[TP] Add wildcard support #122968

kwen2501 · 2024-03-29T17:29:47Z

Adding wildcard support for TP's parallelize_module API.

Example patterns:
layers.*.linear: any characters
layers.?.linear: single character
layers.[1-2]: digit range, matches layers.1 and layers.2

Example use case:
A model have multiple layers, and we want to parallelize the linear module lin inside each layer.

model_tp = parallelize_module(
    model,
    device_mesh,
    {
        "layers.*.lin": ColwiseParallel(),
    },
)

Stack from ghstack (oldest at bottom):

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-03-29T17:29:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122968

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 40 New Failures, 3 Unrelated Failures

As of commit 40258ef with merge base 4dc09d6 ():

NEW FAILURES - The following jobs have failed:

Check mergeability of ghstack PR / ghstack-mergeability-check (gh)
RuntimeError: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x c368228 returned non-zero exit code 1
linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build / build (gh)
##[error]The operation was canceled.
linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-test (gh)
linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-build / build (gh)
##[error]The operation was canceled.
linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-test (gh)
linux-binary-manywheel / manywheel-py3_8-cuda11_8-build / build (gh)
##[error]The operation was canceled.
linux-binary-manywheel / manywheel-py3_8-cuda11_8-test (gh)
linux-binary-manywheel / manywheel-py3_8-cuda12_1-build / build (gh)
##[error]The operation was canceled.
linux-binary-manywheel / manywheel-py3_8-cuda12_1-test (gh)
pull / linux-docs (gh)
pull / linux-focal-cpu-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3_8-clang9-xla / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single-full-jit / build-and-test (default, 1, 1, linux.2xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-mobile-custom-build-static / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.11-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.12-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.8-clang10-onnx (gh)
pull / linux-focal-py3.8-clang10-onnx / build (gh)
##[error]The operation was canceled.
pull / linux-focal-rocm6.0-py3.8 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-cuda11.8-cudnn8-py3.8-clang12 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3-clang12-executorch / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3-clang12-mobile-build / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.10-clang15-asan / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.8-gcc11 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.8-gcc11-mobile-lightweight-dispatch-build / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.8-gcc11-no-ops / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.8-gcc11-pch / build (gh)
##[error]The operation was canceled.
trunk / caffe2-linux-jammy-py3.8-gcc11 / build (gh)
##[error]The operation was canceled.
trunk / libtorch-linux-focal-cuda12.1-py3.7-gcc9-debug / build (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.1-py3.10-gcc9-no-ops / build (gh)
##[error]The operation was canceled.
trunk / linux-focal-rocm6.0-py3.8 / build (gh)
##[error]The operation was canceled.
trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh)
trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh)
trunk / macos-12-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh)
trunk / macos-12-py3-arm64-mps / filter (gh)
##[error]The operation was canceled.
trunk / macos-12-py3-arm64-mps / test (gh)
trunk / pytorch-linux-focal-py3-clang9-android-ndk-r21e-build / build (default, 1, 1, linux.2xlarge) (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / win-vs2019-cpu-py3 / build (gh)
##[error]The operation was canceled.
trunk / win-vs2019-cpu-py3 / build (gh)
##[error]The operation was canceled.
trunk / win-vs2019-cuda11.8-py3 / build (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: a8fac4d Pull Request resolved: #122968

kurman · 2024-04-01T21:15:13Z

I wonder if jq or xpath style spec would improve UX given the hierarchical nature? It comes with a higher complexity though.

XilunWu · 2024-04-01T21:28:14Z

Users need be aware that the patterns in the dict param must be mutually exclusive otherwise the repetitive parallelize_module may cause issue. cc @wanchaol @fduwjj

XilunWu

LGTM

wz337

LGTM! Could we add some tests to demonstrate the usage as well?

kwen2501 · 2024-04-02T00:29:22Z

@wz337 Thanks for the review! The tests are in a stacked PR: #123101

wanchaol

Can you fold the tests PR to this PR? I think every new feature PR should come with tests in the PR itself, not a separate PR

Adding wildcard support for TP's `parallelize_module` API. Example patterns: `layers.*.linear`: any characters `layers.?.linear`: single character `layers.[1-2]`: digit range, matches `layers.1` and `layers.2` Example use case: A model have multiple layers, and we want to parallelize the linear module `lin` inside each layer. ``` model_tp = parallelize_module( model, device_mesh, { "layers.*.lin": ColwiseParallel(), }, ) ``` cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang [ghstack-poisoned]

[TP] Add tests for wildcard support ghstack-source-id: abb8e51 Pull Request resolved: #122968

kwen2501 · 2024-04-02T14:01:34Z

@wanchaol Done!

wanchaol

looks great, thanks for addressing comments!

wanchaol · 2024-04-02T17:18:50Z

torch/testing/_internal/distributed/_tensor/common_dtensor.py

@@ -78,6 +78,17 @@ def reset_parameters(self):
        self.net2.reset_parameters()


+class MLPStacked(nn.Module):
+    def __init__(self, device):


nit: I think it would be nice to have a num_layers arg (can have a default) to control how many MLP layers this stacked MLP construct

wanchaol · 2024-04-02T17:23:44Z

test/distributed/tensor/parallel/test_parallelize_api.py

+            model_tp,
+            device_mesh,
+            {
+                "layers.*.net?": ColwiseParallel(output_layouts=Replicate()),


wondering can we do a e2e col + row test here?

{ "layers.*.net[1]": ColwiseParallel(), "layers.*.net[2]": RowwiseParallel() }

kwen2501 · 2024-04-02T18:01:58Z

@pytorchbot merge

pytorchmergebot · 2024-04-02T18:03:39Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

kwen2501 · 2024-04-02T18:20:37Z

@pytorchbot merge

pytorchmergebot · 2024-04-02T18:22:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kwen2501 · 2024-04-03T02:18:46Z

@pytorchbot merge

pytorchmergebot · 2024-04-03T02:20:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-04-03T02:21:02Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x c3682288af1f7c66a3f685fcccba192299c38c3f returned non-zero exit code 1

The previous cherry-pick is now empty, possibly due to conflict resolution.
If you wish to commit it anyway, use:

    git commit --allow-empty

Otherwise, please use 'git cherry-pick --skip'
On branch main
Your branch is up to date with 'origin/main'.

You are currently cherry-picking commit c3682288af1.
  (all conflicts fixed: run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

nothing to commit, working tree clean

Details for Dev Infra team

Raised by workflow job

@wanchaol

Improve tests per @wanchaol 's suggestions in #122968 Pull Request resolved: #123199 Approved by: https://github.com/wanchaol

Adding wildcard support for TP's `parallelize_module` API. Example patterns: `layers.*.linear`: any characters `layers.?.linear`: single character `layers.[1-2]`: digit range, matches `layers.1` and `layers.2` Example use case: A model have multiple layers, and we want to parallelize the linear module `lin` inside each layer. ``` model_tp = parallelize_module( model, device_mesh, { "layers.*.lin": ColwiseParallel(), }, ) ``` Pull Request resolved: pytorch#122968 Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/wanchaol ghstack dependencies: pytorch#122919

@wanchaol

Improve tests per @wanchaol 's suggestions in pytorch#122968 Pull Request resolved: pytorch#123199 Approved by: https://github.com/wanchaol

atalman · 2024-04-22T21:06:03Z

this is already merged: 5027ef7
hence closing the pr

[TP] Add wildcard support

4da342e

[ghstack-poisoned]

kwen2501 mentioned this pull request Mar 29, 2024

[TP] Avoid splitting path twice #122919

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 29, 2024

kwen2501 added a commit that referenced this pull request Mar 29, 2024

[TP] Add wildcard support

305608b

ghstack-source-id: a8fac4d Pull Request resolved: #122968

This was referenced Mar 29, 2024

[TP] Add wildcard support for * #122923

Closed

[TP] Better error handling in wildcard support #122966

Closed

kwen2501 requested review from wanchaol and wz337 March 29, 2024 18:54

kwen2501 mentioned this pull request Apr 1, 2024

[TP] Add tests for wildcard support #123101

Closed

XilunWu approved these changes Apr 1, 2024

View reviewed changes

wz337 approved these changes Apr 2, 2024

View reviewed changes

wanchaol requested changes Apr 2, 2024

View reviewed changes

kwen2501 added a commit that referenced this pull request Apr 2, 2024

[TP] Add wildcard support

c368228

[TP] Add tests for wildcard support ghstack-source-id: abb8e51 Pull Request resolved: #122968

wanchaol approved these changes Apr 2, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 2, 2024

pytorchmergebot added the merging label Apr 2, 2024

pytorchmergebot removed the merging label Apr 2, 2024

kwen2501 added the release notes: distributed (dtensor) release notes category label Apr 2, 2024

pytorchmergebot added the merging label Apr 2, 2024

kwen2501 mentioned this pull request Apr 2, 2024

[TP] Improve MLPStacked test #123199

Closed

pytorchmergebot added Merged and removed merging labels Apr 2, 2024

kwen2501 mentioned this pull request Apr 2, 2024

[TP][Tests] Replace assertEqual with deepcopy #123218

Closed

pytorchmergebot added the merging label Apr 3, 2024

pytorchmergebot removed the merging label Apr 3, 2024

pytorchmergebot pushed a commit that referenced this pull request Apr 3, 2024

[TP] Improve MLPStacked test (#123199)

f06d77c

Improve tests per @wanchaol 's suggestions in #122968 Pull Request resolved: #123199 Approved by: https://github.com/wanchaol

atalman closed this Apr 22, 2024

atalman reopened this Apr 22, 2024

atalman closed this Apr 22, 2024

github-actions bot deleted the gh/kwen2501/10/head branch May 30, 2024 02:00

awgu mentioned this pull request Dec 30, 2024

[distributed] parallelize_module error with SequenceParallel #143969

Open

[TP] Add wildcard support #122968

[TP] Add wildcard support #122968

Uh oh!

Conversation

kwen2501 commented Mar 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122968

❌ 40 New Failures, 3 Unrelated Failures

Uh oh!

kurman commented Apr 1, 2024

Uh oh!

XilunWu commented Apr 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

wz337 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Apr 2, 2024

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Apr 2, 2024

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol Apr 2, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol Apr 2, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Apr 2, 2024

Uh oh!

pytorchmergebot commented Apr 2, 2024

Merge failed

Uh oh!

kwen2501 commented Apr 2, 2024

Uh oh!

pytorchmergebot commented Apr 2, 2024

Merge started

Uh oh!

kwen2501 commented Apr 3, 2024

Uh oh!

pytorchmergebot commented Apr 3, 2024

Merge started

Uh oh!

pytorchmergebot commented Apr 3, 2024

Merge failed

Uh oh!

atalman commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kwen2501 commented Mar 29, 2024 •

edited

Loading

pytorch-bot bot commented Mar 29, 2024 •

edited

Loading

XilunWu commented Apr 1, 2024 •

edited

Loading

atalman commented Apr 22, 2024 •

edited

Loading