Move DTensor from tau to PyTorch #576

wanchaol · 2022-10-24T22:00:47Z

We plan to move the DTensor implementation from pytorch/tau to pytorch/pytorch. DTensor has been developing under the pytorch/tau repo in this half. Working in an out of tree repo has allowed us to move fast and quickly prototype features with very short turnaround time, but we want to move to core for the following reasons:

Tensor Parallelism beta release in 2.0 will depend on DTensor, and we don’t want to do out of tree release
DTensor alpha release (prototype release) in 2.0, feature release in torch.distributed directly makes a better experience for end users as they don’t need to install new repos and worry about version mismatch.
We want to explore using DTensor with other distributed features together (i.e. distributed checkpointing, fsdp, etc)

Detailed folder structure after the move

We will move the DTensor implementation from tau to PyTorch, concretely this involves:

tau/spmd/tensor -> torch/distributed/_tensor
- tau/spmd/tensor/parallel -> torch/distributed/tensor_parallel in subsequent PRs.
tau/test/spmd/tensor -> test/distributed/_tensor
tau/spmd/testing/common_utils.py -> torch/testing/_internal/common_dtensor.py
Note, we no longer need lagging_op_db things after we move to PyTorch so only comm_utils.py is needed.

All of the imports in tau shall be preserved, which means we will leave a empty folder tau/spmd/tensor/ with __init__.py file to still import the DTensor public APIs, and leave this folder as a experimental folder for the compiler stack to use it for quick hacks like ops support and experimental features, etc.

Move Logistics

Phase 1

Initial code move PR (WIP) @wanchaol tau stack of changes: [dtensor] changes required before moving to core #614, pytorch stack of changes: [dtensor] PART 8: move tensor parallel api and tests to core distributed pytorch#88180
bash script to move the old from tau to pytorch @wanchaol as part of [dtensor] changes required before moving to core #614
make sure DTensor linting in tau stays the same with PyTorch's one (it can be stronger, but it should all pass pytorch linting) initial try make mypy linting be more consistent with core and fix lints #591
Make sure DTensor tests can be run in a fairly efficient manner.

The DTensor source of truth will remain pytorch/tau, so continue submitting PRs to this repro even after those lands.
Any new changes will be copied over by a script (rsync) until phase 2, we will run a script before the beginning of Phase 2.

After these initial PRs land, DTensor will be working in core. The initial PR will have testing disable, but then we will enable tests in smaller followup PRs.

Phase 2 (target date 11/25/2022)

At this switch over point, the source of truth will become PyTorch core.

All open PRs about DTensor will be closed and asked to resubmit to pytorch/pytorch, unless it's something experimental.
We will delete the code and leave a empty folder tau/spmd/tensor/ with __init__.py file to still import the DTensor public APIs, and leave this folder as a experimental folder for the compiler stack to use it for quick hacks like ops support and experimental features, etc.

How you can help

To be discussed

Shall we continue using pytorch/tau for DTensor issue tracking together with the spmd compiler stack?

The text was updated successfully, but these errors were encountered:

This is part of TP Beta Release efforts. ref: pytorch/PiPPy#576 Pull Request resolved: #89242 Approved by: https://github.com/wanchaol

This is part of TP Beta Release efforts. ref: pytorch/PiPPy#576 Pull Request resolved: #89467 Approved by: https://github.com/wanchaol

This is part of TP Beta Release efforts. ref: pytorch/PiPPy#576 Pull Request resolved: pytorch#89242 Approved by: https://github.com/wanchaol

…h#89467) This is part of TP Beta Release efforts. ref: pytorch/PiPPy#576 Pull Request resolved: pytorch#89467 Approved by: https://github.com/wanchaol

wanchaol pinned this issue Oct 25, 2022

wanchaol added the SPMD label Oct 31, 2022

fduwjj mentioned this issue Nov 17, 2022

[PT-D][1/N] Sync TP Beta change to prod pytorch/pytorch#89242

Closed

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Nov 19, 2022

[PT-D][1/N] Sync TP Beta change to prod (#89242)

6afe341

This is part of TP Beta Release efforts. ref: pytorch/PiPPy#576 Pull Request resolved: #89242 Approved by: https://github.com/wanchaol

fduwjj mentioned this issue Nov 21, 2022

[PT-D][Tensor Parallelism][2/N] Sync TP API change to PT prod pytorch/pytorch#89467

Closed

This was referenced Nov 28, 2022

[dtensor] remove dtensor from tau as it now lives in core #649

Merged

[spmd] complete softmax and _softmax_backward_data to support aggregate on sharding dim #440

Closed

kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Dec 10, 2022

[PT-D][1/N] Sync TP Beta change to prod (pytorch#89242)

969c774

This is part of TP Beta Release efforts. ref: pytorch/PiPPy#576 Pull Request resolved: pytorch#89242 Approved by: https://github.com/wanchaol

wanchaol unpinned this issue Feb 1, 2023

wanchaol closed this as completed Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move DTensor from tau to PyTorch #576

Move DTensor from tau to PyTorch #576

wanchaol commented Oct 24, 2022 •

edited

Loading

Move DTensor from tau to PyTorch #576

Move DTensor from tau to PyTorch #576

Comments

wanchaol commented Oct 24, 2022 • edited Loading

Detailed folder structure after the move

Move Logistics

Phase 1

Phase 2 (target date 11/25/2022)

How you can help

To be discussed

wanchaol commented Oct 24, 2022 •

edited

Loading