-
Notifications
You must be signed in to change notification settings - Fork 22k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dtensor][8/N] Introduce cost model for sharding #109145
Conversation
This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109145
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2192bdb with merge base 898482f (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR adds some basic comm cost model for sharding prop ghstack-source-id: 0efce2c591ff42d05e79a8fd03c365926c096cb4 Pull Request resolved: #109145
This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]
This PR adds some basic comm cost model for sharding prop ghstack-source-id: c45e652b428ee5ec1cb021975307e49dda6ffb7a Pull Request resolved: #109145
This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]
This PR adds some basic comm cost model for sharding prop ghstack-source-id: 1b99d3b8399e0944a2d849aedcd7b17d5bf0a70a Pull Request resolved: #109145
This PR adds some basic comm cost model for sharding prop. Why we need this? operators can generate multiple placement strategies, i.e. for matmul we have at least 4 possible shardings: `1. Shard(0), R 2. R, Shard(1) 3. Shard(1), Shard(0) 4. R, R` We need to be able to choose from one of these options during runtime, and perform resharding with reasonable choices. This is why we are building a cost model for sharding here. In this PR we associate each possible sharding strategy with redistribute costs. For eager mode since we run ops eagerly, we simply perform a min cost selection. One can imagine if we have some global information the strategy selection would become more intelligient. [ghstack-poisoned]
This PR adds some basic comm cost model for sharding prop. Why we need this? operators can generate multiple placement strategies, i.e. for matmul we have at least 4 possible shardings: `1. Shard(0), R 2. R, Shard(1) 3. Shard(1), Shard(0) 4. R, R` We need to be able to choose from one of these options during runtime, and perform resharding with reasonable choices. This is why we are building a cost model for sharding here. In this PR we associate each possible sharding strategy with redistribute costs. For eager mode since we run ops eagerly, we simply perform a min cost selection. One can imagine if we have some global information the strategy selection would become more intelligient. [ghstack-poisoned]
This PR adds some basic comm cost model for sharding prop. Why we need this? operators can generate multiple placement strategies, i.e. for matmul we have at least 4 possible shardings: `1. Shard(0), R 2. R, Shard(1) 3. Shard(1), Shard(0) 4. R, R` We need to be able to choose from one of these options during runtime, and perform resharding with reasonable choices. This is why we are building a cost model for sharding here. In this PR we associate each possible sharding strategy with redistribute costs. For eager mode since we run ops eagerly, we simply perform a min cost selection. One can imagine if we have some global information the strategy selection would become more intelligient. [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add at least one simple UT for the cost cost modeling?
This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]
This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]
This PR switches matrix ops to generate the sharding strategies, and with the cost selection algorithm introduced in the previous PR we are able to enable this and more ops to leverage strategy based sharding prop This also fixes a bunch of corner cases that existing propagation does not cover, resulting in full coverage for baddbmm Pull Request resolved: #110717 Approved by: https://github.com/fduwjj ghstack dependencies: #109145
As titled, this also handles sth like [Shard(0), Shard(0)] correctly for pointwise ops, which was previously errored out Pull Request resolved: #111234 Approved by: https://github.com/fduwjj ghstack dependencies: #109145, #110717
This add __Str__ to op schema and dtensor spec for ease of reading Pull Request resolved: #111278 Approved by: https://github.com/fduwjj ghstack dependencies: #109145, #110717, #111234
This PR adds some basic comm cost model for sharding prop Pull Request resolved: pytorch#109145 Approved by: https://github.com/fduwjj
This PR switches matrix ops to generate the sharding strategies, and with the cost selection algorithm introduced in the previous PR we are able to enable this and more ops to leverage strategy based sharding prop This also fixes a bunch of corner cases that existing propagation does not cover, resulting in full coverage for baddbmm Pull Request resolved: pytorch#110717 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#109145
As titled, this also handles sth like [Shard(0), Shard(0)] correctly for pointwise ops, which was previously errored out Pull Request resolved: pytorch#111234 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#109145, pytorch#110717
This add __Str__ to op schema and dtensor spec for ease of reading Pull Request resolved: pytorch#111278 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#109145, pytorch#110717, pytorch#111234
Stack from ghstack (oldest at bottom):
This PR adds some basic comm cost model for sharding prop