Skip to content

Conversation

awgu
Copy link
Collaborator

@awgu awgu commented Dec 10, 2022

Stack from ghstack:

Every sharded strategy always allocates a (padded) sharded gradient as the target tensor for the reduce-scatter. This PR moves that allocation to the default stream instead of the post-backward stream.

Minor: This PR changes from state.device to handle.device, which has no semantic difference. It is just better to lower as much logic as possible to the handle when appropriate.

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 10, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90617

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures, 10 Pending

As of commit 6007ced:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Every sharded strategy always allocates a (padded) sharded gradient as the target tensor for the reduce-scatter. This PR moves that allocation to the default stream instead of the post-backward stream.

Minor: This PR changes from `state.device` to `handle.device`, which has no semantic difference. It is just better to lower as much logic as possible to the `handle` when appropriate.

[ghstack-poisoned]
Every sharded strategy always allocates a (padded) sharded gradient as the target tensor for the reduce-scatter. This PR moves that allocation to the default stream instead of the post-backward stream.

Minor: This PR changes from `state.device` to `handle.device`, which has no semantic difference. It is just better to lower as much logic as possible to the `handle` when appropriate.

[ghstack-poisoned]
Every sharded strategy always allocates a (padded) sharded gradient as the target tensor for the reduce-scatter. This PR moves that allocation to the default stream instead of the post-backward stream.

Minor: This PR changes from `state.device` to `handle.device`, which has no semantic difference. It is just better to lower as much logic as possible to the `handle` when appropriate.

[ghstack-poisoned]
Every sharded strategy always allocates a (padded) sharded gradient as the target tensor for the reduce-scatter. This PR moves that allocation to the default stream instead of the post-backward stream.

Minor: This PR changes from `state.device` to `handle.device`, which has no semantic difference. It is just better to lower as much logic as possible to the `handle` when appropriate.

[ghstack-poisoned]
@awgu
Copy link
Collaborator Author

awgu commented Dec 11, 2022

This PR got absorbed into the previous PR by accident when dealing with rebase conflicts.

@awgu awgu closed this Dec 11, 2022
@facebook-github-bot facebook-github-bot deleted the gh/awgu/255/head branch June 8, 2023 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant