Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests #87987

Closed
wants to merge 21 commits into from

Conversation

wz337
Copy link
Contributor

@wz337 wz337 commented Oct 28, 2022

This PR includes:

Changes from @kumpera (#86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:

python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py

test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.

[T134844615]

Add docstring and update comments in the following PRs.

cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 28, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87987

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ac5583a:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (sharded) release notes category label Oct 28, 2022
@wz337 wz337 requested a review from fduwjj October 28, 2022 15:25
@wz337 wz337 marked this pull request as draft November 4, 2022 01:23
@wz337 wz337 changed the title [PT-D][Checkpointing]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests [PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests Nov 16, 2022
@github-actions github-actions bot added ciflow/inductor module: amp (automated mixed precision) autocast module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration NNC oncall: quantization Quantization support in PyTorch labels Nov 16, 2022
@pytorch-bot pytorch-bot bot added the ciflow/mps Run MPS tests (subset of trunk) label Nov 16, 2022
@wz337 wz337 marked this pull request as ready for review November 17, 2022 21:54
@wz337 wz337 removed oncall: quantization Quantization support in PyTorch module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration module: amp (automated mixed precision) autocast NNC ciflow/mps Run MPS tests (subset of trunk) module: inductor module: dynamo ciflow/inductor labels Nov 17, 2022
@pytorchmergebot
Copy link
Collaborator

Successfully rebased fast_writer onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fast_writer && git pull --rebase)

@wz337
Copy link
Contributor Author

wz337 commented Nov 30, 2022

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…checkpointing and Update tests (pytorch#87987)

This PR includes:

Changes from @kumpera (pytorch#86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:

```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```

test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.

[T134844615]

## Add docstring and update comments in the following PRs.
Pull Request resolved: pytorch#87987
Approved by: https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) release notes: distributed (sharded) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants