Skip to content

Conversation

ankitageorge
Copy link
Contributor

@ankitageorge ankitageorge commented May 30, 2025

Stack from ghstack (oldest at bottom):

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: D75536985

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented May 30, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154743

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit fad5fa1 with merge base f79689b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) labels May 30, 2025
ankitageorge added a commit that referenced this pull request May 30, 2025
Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

ghstack-source-id: 287228528
Pull Request resolved: #154743
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
ankitageorge added a commit that referenced this pull request May 30, 2025
Pull Request resolved: #154743

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)
ghstack-source-id: 287291639
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
ankitageorge added a commit that referenced this pull request Jun 13, 2025
Pull Request resolved: #154743

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.
ghstack-source-id: 290286966

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
ankitageorge added a commit that referenced this pull request Jun 13, 2025
Pull Request resolved: #154743

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.
ghstack-source-id: 290311030

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
ankitageorge added a commit that referenced this pull request Jun 23, 2025
Pull Request resolved: #154743

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.
ghstack-source-id: 292189152

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
ankitageorge added a commit that referenced this pull request Jun 23, 2025
Pull Request resolved: #154743

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.
ghstack-source-id: 292214509

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
ankitageorge added a commit that referenced this pull request Jun 24, 2025
Pull Request resolved: #154743

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.
ghstack-source-id: 292257534

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

num_threads: Number of threads to use for parallel processing of saving data to output files.
"""
# Create filesystem using fsspec for file operations
input_fs, _ = url_to_fs(input_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as well. Lets add some timing logs to keep track of the performance of this script.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 30, 2025
Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
ankitageorge added a commit that referenced this pull request Jun 30, 2025
Pull Request resolved: #154743

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.
ghstack-source-id: 293486705

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
ankitageorge added a commit that referenced this pull request Jun 30, 2025
Pull Request resolved: #154743

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.
ghstack-source-id: 293495694

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

@ankitageorge
Copy link
Contributor Author

@pytorchmergebot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / verify-cachebench-cpu-build / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jul 1, 2025
…h step (#156705)

Title - we can consolidate the shards to a full tensors, optionally behind a flag, in the finish step of DCP.save
also adds the thread count argument which is configurable for users, before we were just using the default of 1.
Re-creating #155940 bc it got into a bad detached state

Differential Revision: [D77231774](https://our.internmc.facebook.com/intern/diff/D77231774/)

Pull Request resolved: #156705
Approved by: https://github.com/saumishr
ghstack dependencies: #154743
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75536985

pytorchmergebot pushed a commit that referenced this pull request Jul 2, 2025
Need to change an argument name that was changed in the test so that it doesn't throw

Differential Revision: [D77604210](https://our.internmc.facebook.com/intern/diff/D77604210/)

Pull Request resolved: #157386
Approved by: https://github.com/meetv18
ghstack dependencies: #154743, #156705
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants