Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move DTensor from tau to PyTorch #576

Closed
14 of 16 tasks
wanchaol opened this issue Oct 24, 2022 · 0 comments
Closed
14 of 16 tasks

Move DTensor from tau to PyTorch #576

wanchaol opened this issue Oct 24, 2022 · 0 comments
Labels

Comments

@wanchaol
Copy link
Contributor

wanchaol commented Oct 24, 2022

We plan to move the DTensor implementation from pytorch/tau to pytorch/pytorch. DTensor has been developing under the pytorch/tau repo in this half. Working in an out of tree repo has allowed us to move fast and quickly prototype features with very short turnaround time, but we want to move to core for the following reasons:

  • Tensor Parallelism beta release in 2.0 will depend on DTensor, and we don’t want to do out of tree release
  • DTensor alpha release (prototype release) in 2.0, feature release in torch.distributed directly makes a better experience for end users as they don’t need to install new repos and worry about version mismatch.
  • We want to explore using DTensor with other distributed features together (i.e. distributed checkpointing, fsdp, etc)

Detailed folder structure after the move

We will move the DTensor implementation from tau to PyTorch, concretely this involves:

  • tau/spmd/tensor -> torch/distributed/_tensor
    • tau/spmd/tensor/parallel -> torch/distributed/tensor_parallel in subsequent PRs.
  • tau/test/spmd/tensor -> test/distributed/_tensor
  • tau/spmd/testing/common_utils.py -> torch/testing/_internal/common_dtensor.py
    Note, we no longer need lagging_op_db things after we move to PyTorch so only comm_utils.py is needed.

All of the imports in tau shall be preserved, which means we will leave a empty folder tau/spmd/tensor/ with __init__.py file to still import the DTensor public APIs, and leave this folder as a experimental folder for the compiler stack to use it for quick hacks like ops support and experimental features, etc.

Move Logistics

Phase 1

The DTensor source of truth will remain pytorch/tau, so continue submitting PRs to this repro even after those lands.
Any new changes will be copied over by a script (rsync) until phase 2, we will run a script before the beginning of Phase 2.

After these initial PRs land, DTensor will be working in core. The initial PR will have testing disable, but then we will enable tests in smaller followup PRs.

Phase 2 (target date 11/25/2022)

At this switch over point, the source of truth will become PyTorch core.

  • All open PRs about DTensor will be closed and asked to resubmit to pytorch/pytorch, unless it's something experimental.
  • We will delete the code and leave a empty folder tau/spmd/tensor/ with __init__.py file to still import the DTensor public APIs, and leave this folder as a experimental folder for the compiler stack to use it for quick hacks like ops support and experimental features, etc.

How you can help

  • Reduce testing time of test_dtensor_ops.py. Specifically we need to make the tests more efficient with less process or process group initialization (i.e. maybe use run_subtests or multithreaded PG), this is critical if we want to enable tests in core without too much process/pg initializations.
  • Remove common size variable override for operator db tests here
  • Stay aligned with pytorch’s linting so that the movement will be simply coping over.
  • Write script for phase 1 to copy changes from tau to PyTorch for new DTensor PRs, see a similar script in [dtensor] changes required before moving to core #614
  • Land the initial migration PR (this blocks most other items below) @wanchaol
  • Enable all unit tests in core @wanchaol
  • Add enough debugging points so that the compiler stack could debug the issues without editing core code. (i.e. DEBUG_VERBOSE)
  • move/rename tensor/parallel folder to tensor_parallel and move it out of distributed/_tensor
  • Add docs entry for tensor_parallel folder
  • Add tags for dtensor module, tensor_parallel module

To be discussed

Shall we continue using pytorch/tau for DTensor issue tracking together with the spmd compiler stack?

@wanchaol wanchaol pinned this issue Oct 25, 2022
@wanchaol wanchaol added the SPMD label Oct 31, 2022
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Nov 19, 2022
This is part of TP Beta Release efforts.

ref: pytorch/PiPPy#576

Pull Request resolved: #89242
Approved by: https://github.com/wanchaol
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Nov 22, 2022
This is part of TP Beta Release efforts.
ref: pytorch/PiPPy#576
Pull Request resolved: #89467
Approved by: https://github.com/wanchaol
kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Dec 10, 2022
This is part of TP Beta Release efforts.

ref: pytorch/PiPPy#576

Pull Request resolved: pytorch#89242
Approved by: https://github.com/wanchaol
kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Dec 10, 2022
@wanchaol wanchaol unpinned this issue Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant