Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diasble distributed per-layer clipping with hooks grad sample mode #747

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

iden-kalemaj
Copy link
Contributor

Summary:
We disable support for distributed per-layer clipping with "hooks" grad sample mode, since it raises an error when using register_full_backward_hook. Distributed per-layer clipping with "ew" grad sample mode can still be used.

The issue arises because DistributedPerLayerOptimizer uses per-parameter hooks on top of the per-module hooks. During the backward pass, the per-parameter hooks fire before the per-module hooks. Per-sample gradients are computed when the per-module hooks fire, and an error occurs when the per-parameter hooks try to access the per-sample gradients before they are computed. Forcing the order in which hooks are called is not possible with PyTorch.

In the future, we may consider an approach that does not use per-parameter hooks. Given the limited use of per-layer clipping, we disable the faulty case of distributed data parallel with "hooks".

Differential Revision: D71706681

…rch#720)

Summary:

register_backward_hook is deprecated and may lead to errors in gradient calculation. We switch to the supported register_full_backward_hook.

Differential Revision: D68562558
Summary:
We disable support for distributed per-layer clipping with "hooks" grad sample mode, since it raises an error when using `register_full_backward_hook`. Distributed per-layer clipping with "ew" grad sample mode can still be used.

The issue arises because DistributedPerLayerOptimizer uses per-parameter hooks on top of the per-module hooks. During the backward pass, the per-parameter hooks fire before the per-module hooks. Per-sample gradients are computed when the per-module hooks fire, and an error occurs when the per-parameter hooks try to access the per-sample gradients before they are computed. Forcing the order in which hooks are called is not possible with PyTorch.

In the future, we may consider an approach that does not use per-parameter hooks. Given the limited use of per-layer clipping, we disable the faulty case of distributed data parallel with "hooks".

Differential Revision: D71706681
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 24, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71706681

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants