[dtensor][8/N] Introduce cost model for sharding #109145

wanchaol · 2023-09-12T22:25:37Z

Stack from ghstack (oldest at bottom):

This PR adds some basic comm cost model for sharding prop

This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]

pytorch-bot · 2023-09-12T22:25:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109145

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2192bdb with merge base 898482f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This PR adds some basic comm cost model for sharding prop ghstack-source-id: 0efce2c591ff42d05e79a8fd03c365926c096cb4 Pull Request resolved: #109145

This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]

This PR adds some basic comm cost model for sharding prop ghstack-source-id: c45e652b428ee5ec1cb021975307e49dda6ffb7a Pull Request resolved: #109145

This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]

This PR adds some basic comm cost model for sharding prop ghstack-source-id: 1b99d3b8399e0944a2d849aedcd7b17d5bf0a70a Pull Request resolved: #109145

torch/distributed/_tensor/_collective_utils.py

This PR adds some basic comm cost model for sharding prop. Why we need this? operators can generate multiple placement strategies, i.e. for matmul we have at least 4 possible shardings: `1. Shard(0), R 2. R, Shard(1) 3. Shard(1), Shard(0) 4. R, R` We need to be able to choose from one of these options during runtime, and perform resharding with reasonable choices. This is why we are building a cost model for sharding here. In this PR we associate each possible sharding strategy with redistribute costs. For eager mode since we run ops eagerly, we simply perform a min cost selection. One can imagine if we have some global information the strategy selection would become more intelligient. [ghstack-poisoned]

torch/distributed/_tensor/placement_types.py

torch/distributed/_tensor/_collective_utils.py

fduwjj

Do we want to add at least one simple UT for the cost cost modeling?

This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]

torch/distributed/_tensor/ops/tensor_ops.py

fduwjj

LGTM

This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]

This PR switches matrix ops to generate the sharding strategies, and with the cost selection algorithm introduced in the previous PR we are able to enable this and more ops to leverage strategy based sharding prop This also fixes a bunch of corner cases that existing propagation does not cover, resulting in full coverage for baddbmm Pull Request resolved: #110717 Approved by: https://github.com/fduwjj ghstack dependencies: #109145

As titled, this also handles sth like [Shard(0), Shard(0)] correctly for pointwise ops, which was previously errored out Pull Request resolved: #111234 Approved by: https://github.com/fduwjj ghstack dependencies: #109145, #110717

This add __Str__ to op schema and dtensor spec for ease of reading Pull Request resolved: #111278 Approved by: https://github.com/fduwjj ghstack dependencies: #109145, #110717, #111234

This PR adds some basic comm cost model for sharding prop Pull Request resolved: pytorch#109145 Approved by: https://github.com/fduwjj

This PR switches matrix ops to generate the sharding strategies, and with the cost selection algorithm introduced in the previous PR we are able to enable this and more ops to leverage strategy based sharding prop This also fixes a bunch of corner cases that existing propagation does not cover, resulting in full coverage for baddbmm Pull Request resolved: pytorch#110717 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#109145

As titled, this also handles sth like [Shard(0), Shard(0)] correctly for pointwise ops, which was previously errored out Pull Request resolved: pytorch#111234 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#109145, pytorch#110717

This add __Str__ to op schema and dtensor spec for ease of reading Pull Request resolved: pytorch#111278 Approved by: https://github.com/fduwjj ghstack dependencies: pytorch#109145, pytorch#110717, pytorch#111234

[dtensor][8/N] Introduce cost model for sharding

210493d

This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]

wanchaol requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, fegin, fduwjj and kiukchung as code owners September 12, 2023 22:25

wanchaol requested review from d4l3k and wz337 as code owners September 12, 2023 22:25

wanchaol added a commit that referenced this pull request Sep 12, 2023

[dtensor][8/N] Introduce cost model for sharding

4c618a5

This PR adds some basic comm cost model for sharding prop ghstack-source-id: 0efce2c591ff42d05e79a8fd03c365926c096cb4 Pull Request resolved: #109145

Update on "[dtensor][8/N] Introduce cost model for sharding"

88ad346

This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]

wanchaol added a commit that referenced this pull request Sep 15, 2023

[dtensor][8/N] Introduce cost model for sharding

57bcca1

This PR adds some basic comm cost model for sharding prop ghstack-source-id: c45e652b428ee5ec1cb021975307e49dda6ffb7a Pull Request resolved: #109145

Update on "[dtensor][8/N] Introduce cost model for sharding"

31fc074

This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]

wanchaol added a commit that referenced this pull request Oct 2, 2023

[dtensor][8/N] Introduce cost model for sharding

3aec559

This PR adds some basic comm cost model for sharding prop ghstack-source-id: 1b99d3b8399e0944a2d849aedcd7b17d5bf0a70a Pull Request resolved: #109145

wanchaol requested review from yf225, Chillee and sanketpurandare October 2, 2023 21:23

Chillee reviewed Oct 2, 2023

View reviewed changes

torch/distributed/_tensor/_collective_utils.py Outdated Show resolved Hide resolved

sanketpurandare reviewed Oct 3, 2023

View reviewed changes

torch/distributed/_tensor/_collective_utils.py Show resolved Hide resolved

wanchaol added the release notes: distributed (dtensor) release notes category label Oct 6, 2023

wanchaol mentioned this pull request Oct 13, 2023

[dtensor][10/n] switch pointwise op to use op strategy #111234

Closed

fduwjj reviewed Oct 13, 2023

View reviewed changes

torch/distributed/_tensor/placement_types.py Show resolved Hide resolved

fduwjj reviewed Oct 13, 2023

View reviewed changes

torch/distributed/_tensor/_collective_utils.py Show resolved Hide resolved

fduwjj reviewed Oct 14, 2023

View reviewed changes

torch/distributed/_tensor/_collective_utils.py Outdated Show resolved Hide resolved

fduwjj reviewed Oct 14, 2023

View reviewed changes

Update on "[dtensor][8/N] Introduce cost model for sharding"

ce387e4

This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]

wanchaol mentioned this pull request Oct 14, 2023

[dtensor][11/n] adds some __str__ for ease of read #111278

Closed

wanchaol requested a review from fduwjj October 14, 2023 04:06

Update on "[dtensor][8/N] Introduce cost model for sharding"

6be8b98

This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]

fduwjj reviewed Oct 14, 2023

View reviewed changes

torch/distributed/_tensor/ops/tensor_ops.py Outdated Show resolved Hide resolved

fduwjj approved these changes Oct 14, 2023

View reviewed changes

Update on "[dtensor][8/N] Introduce cost model for sharding"

2192bdb

This PR adds some basic comm cost model for sharding prop [ghstack-poisoned]

pytorchmergebot added the Merged label Oct 15, 2023

pytorchmergebot closed this in b4ab8ac Oct 15, 2023

facebook-github-bot deleted the gh/wanchaol/356/head branch October 19, 2023 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dtensor][8/N] Introduce cost model for sharding #109145

[dtensor][8/N] Introduce cost model for sharding #109145

wanchaol commented Sep 12, 2023 •

edited

Loading

pytorch-bot bot commented Sep 12, 2023 •

edited

Loading

fduwjj left a comment

fduwjj left a comment

[dtensor][8/N] Introduce cost model for sharding #109145

[dtensor][8/N] Introduce cost model for sharding #109145

Conversation

wanchaol commented Sep 12, 2023 • edited Loading

pytorch-bot bot commented Sep 12, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109145

✅ No Failures

fduwjj left a comment

Choose a reason for hiding this comment

fduwjj left a comment

Choose a reason for hiding this comment

wanchaol commented Sep 12, 2023 •

edited

Loading

pytorch-bot bot commented Sep 12, 2023 •

edited

Loading