Skip to content

Conversation

davidberard98
Copy link
Contributor

@davidberard98 davidberard98 commented Mar 19, 2024

Stack from ghstack (oldest at bottom):

Motivation: #112771

Summary: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will not pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones.

Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards.

Tests #122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing.

Alternatives/RFC:

  • Is this the right thing to do with cudagraphs?
  • Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

Differential Revision: D55079094

Copy link

pytorch-bot bot commented Mar 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122158

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 9d0fef9 with merge base 4b53590 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@davidberard98
Copy link
Contributor Author

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…aligned"

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094)

[ghstack-poisoned]
…aligned"


**Motivation**: #112771

**Summary**: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones.

Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards.

**Tests** #122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing.

**Alternatives/RFC**:
* Is this the right thing to do with cudagraphs?
* Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now)

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094)

[ghstack-poisoned]
davidberard98 added a commit that referenced this pull request Mar 22, 2024
@davidberard98 davidberard98 marked this pull request as ready for review March 22, 2024 14:30
@davidberard98 davidberard98 added the topic: not user facing topic category label Mar 22, 2024
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, we should figure out a policy but this is a great start

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 22, 2024
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! We should be able to assume alignment on parameters / static_inputs. If folks are going to be testing perf on this internally might be worth doing ?

@davidberard98
Copy link
Contributor Author

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@davidberard98
Copy link
Contributor Author

@eellison in the internal target this is apparently showing sufficiently good QPS. I think it’s reasonable to leave this as is until we have a use case where we expect the additional considerations to be necessary for QPS?

@davidberard98
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

# assume_aligned_inputs means that we assume that inputs will be aligned; we generate
# code using this assumption, and clone tensors before use if they aren't aligned.
# In the common case, most inputs will be aligned.
assume_aligned_inputs: bool = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the naming. If we are cloning unaligned tensors, aren't we forcing the tensors to be aligned, not assuming they are going to be aligned? Or is this an inductor vs rest of the system perspective thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, didn't see this comment until now.

I guess my thought process was that, when this flag is turned on, inductor generates triton code that assumes its inputs will be aligned.

But I can see why this is confusing - do you have a suggestion for a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or, put another way, assume_aligned_inputs=True -> optimize for the case where inputs are aligned, assume_aligned_inputs=False -> optimize for the case where inputs are unaligned

pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
…22158)

**Motivation**: #112771

**Summary**: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones.

Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards.

**Tests** #122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing.

**Alternatives/RFC**:
* Is this the right thing to do with cudagraphs?
* Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now)

Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094)
Pull Request resolved: #122158
Approved by: https://github.com/ezyang
@github-actions github-actions bot deleted the gh/davidberard98/279/head branch May 4, 2024 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants