[dtensor] implement dim-0 (row) embedding sharding with MaskPartial #118080

wanchaol · 2024-01-23T06:28:42Z

Stack from ghstack (oldest at bottom):

This PR add support for rowwise sharded embedding by adding a
MaskPartial placement that inherits from the default partial placement,
and override the Partial constracts to construct the mask and release
the mask after the reduction

The MaskPartial placement have the potential to support other ops
sharding computation that requires a mask for semantic correctness.
currently make it live in the embedding ops but we can move it to a
common place if needed

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225

This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed [ghstack-poisoned]

pytorch-bot · 2024-01-23T06:28:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118080

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3748220 with merge base d59c2d6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…skPartial" This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

…skPartial" This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed [ghstack-poisoned]

…skPartial" This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed ghstack-source-id: df0a074 Pull Request resolved: #118080

tianyu-l

LGTM!
This is an elegant way of implementing row-wise embedding in DTensor. The creative use of a buffer variable in _MaskPartial slightly violates the designing principle of making Placement subclasses frozen (e.g. for caching). Nevertheless, this should be justified by the benefits it brings.

tianyu-l · 2024-01-25T23:46:55Z

torch/distributed/_tensor/ops/embedding_ops.py

+        if self.mask_buffer.data is not None or other.mask_buffer.data is not None:
+            return False


Some remarks for what we discussed offline:

For _MaskPartial produced in the output of sharding propagation (either output_spec or schema_suggestions in OutputSharding), the cached self.mask_buffer.data could be filled and not released (by reductions), but still be returned with cache hit. An extremal example is that two parallel row-wise embeddings are applied on the same (replicated) input. Such cases should be rare, and if happen MaskBuffer.materialize_mask would just throw exceptions, which is OK.

self.mask_buffer.data is almost always not None as input to sharding propagation (because otherwise the _MaskPartial probably has been reduced to Replicate or Shard), so this is effectively forbidding cache hit when _MaskPartial is input, as is noted in a followup PR [dtensor] add comment to clarify MaskPartial cache hit #118330.

As titled, this PR enables the rowwise embedding sharding in the RowwiseParallel style, and add tests to ensure it's working as expected Pull Request resolved: #118242 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079, #118080

…artial (#118080)" This reverts commit 8cc02b4. Reverted #118080 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](#118079 (comment)))

pytorchmergebot · 2024-01-26T08:47:21Z

@wanchaol your PR has been successfully reverted.

As titled, this PR enables the rowwise embedding sharding in the RowwiseParallel style, and add tests to ensure it's working as expected Pull Request resolved: #118242 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079, #118080

…ytorch#118080) This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed Pull Request resolved: pytorch#118080 Approved by: https://github.com/tianyu-l ghstack dependencies: pytorch#118079

…8242) As titled, this PR enables the rowwise embedding sharding in the RowwiseParallel style, and add tests to ensure it's working as expected Pull Request resolved: pytorch#118242 Approved by: https://github.com/tianyu-l ghstack dependencies: pytorch#118079, pytorch#118080

This was referenced Jan 23, 2024

[dtensor] make local_shard_size_on_dim be staticmethod #118078

Closed

[dtensor] refactor partial redistribution logic #113334

Closed

wanchaol mentioned this pull request Jan 23, 2024

[dtensor] rewrite embedding ops using op strategy #118079

Closed

github-actions bot added oncall: distributed Add this issue/PR to distributed oncall triage queue ciflow/inductor labels Jan 23, 2024

wanchaol added ciflow/trunk Trigger trunk jobs on your pull request release notes: distributed (dtensor) release notes category labels Jan 23, 2024

wanchaol requested review from XilunWu, tianyu-l and wz337 January 23, 2024 18:49

This was referenced Jan 24, 2024

[tp] enable rowwise embedding sharding in RowwiseParallel #118242

Closed

[dtensor] add comment to clarify MaskPartial cache hit #118330

Closed

tianyu-l approved these changes Jan 26, 2024

View reviewed changes

pytorchmergebot closed this in 8cc02b4 Jan 26, 2024

pytorchmergebot added the Merged label Jan 26, 2024

pytorchmergebot added the Reverted label Jan 26, 2024

pytorchmergebot reopened this Jan 26, 2024

pytorchmergebot closed this in dc8357b Jan 26, 2024

facebook-github-bot deleted the gh/wanchaol/430/head branch January 30, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[dtensor] implement dim-0 (row) embedding sharding with MaskPartial #118080

[dtensor] implement dim-0 (row) embedding sharding with MaskPartial #118080

Uh oh!

wanchaol commented Jan 23, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 23, 2024 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Jan 25, 2024

Uh oh!

pytorchmergebot commented Jan 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if self.mask_buffer.data is not None or other.mask_buffer.data is not None:
		return False

[dtensor] implement dim-0 (row) embedding sharding with MaskPartial #118080

[dtensor] implement dim-0 (row) embedding sharding with MaskPartial #118080

Uh oh!

Conversation

wanchaol commented Jan 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118080

✅ No Failures

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jan 25, 2024

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Jan 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wanchaol commented Jan 23, 2024 •

edited

Loading

pytorch-bot bot commented Jan 23, 2024 •

edited

Loading