Diasble distributed per-layer clipping with hooks grad sample mode #747

iden-kalemaj · 2025-03-24T00:03:57Z

Summary:
We disable support for distributed per-layer clipping with "hooks" grad sample mode, since it raises an error when using register_full_backward_hook. Distributed per-layer clipping with "ew" grad sample mode can still be used.

The issue arises because DistributedPerLayerOptimizer uses per-parameter hooks on top of the per-module hooks. During the backward pass, the per-parameter hooks fire before the per-module hooks. Per-sample gradients are computed when the per-module hooks fire, and an error occurs when the per-parameter hooks try to access the per-sample gradients before they are computed. Forcing the order in which hooks are called is not possible with PyTorch.

In the future, we may consider an approach that does not use per-parameter hooks. Given the limited use of per-layer clipping, we disable the faulty case of distributed data parallel with "hooks".

Differential Revision: D71706681

…rch#720) Summary: register_backward_hook is deprecated and may lead to errors in gradient calculation. We switch to the supported register_full_backward_hook. Differential Revision: D68562558

Summary: We disable support for distributed per-layer clipping with "hooks" grad sample mode, since it raises an error when using `register_full_backward_hook`. Distributed per-layer clipping with "ew" grad sample mode can still be used. The issue arises because DistributedPerLayerOptimizer uses per-parameter hooks on top of the per-module hooks. During the backward pass, the per-parameter hooks fire before the per-module hooks. Per-sample gradients are computed when the per-module hooks fire, and an error occurs when the per-parameter hooks try to access the per-sample gradients before they are computed. Forcing the order in which hooks are called is not possible with PyTorch. In the future, we may consider an approach that does not use per-parameter hooks. Given the limited use of per-layer clipping, we disable the faulty case of distributed data parallel with "hooks". Differential Revision: D71706681

facebook-github-bot · 2025-03-24T00:04:08Z

This pull request was exported from Phabricator. Differential Revision: D71706681

iden-kalemaj added 2 commits March 23, 2025 17:03

Replace register_backward_hook with register_full_backward_hook (pyto…

33ba6ce

…rch#720) Summary: register_backward_hook is deprecated and may lead to errors in gradient calculation. We switch to the supported register_full_backward_hook. Differential Revision: D68562558

facebook-github-bot added the CLA Signed label Mar 24, 2025

facebook-github-bot added the fb-exported label Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diasble distributed per-layer clipping with hooks grad sample mode #747

Diasble distributed per-layer clipping with hooks grad sample mode #747

iden-kalemaj commented Mar 24, 2025

facebook-github-bot commented Mar 24, 2025

Diasble distributed per-layer clipping with hooks grad sample mode #747

Are you sure you want to change the base?

Diasble distributed per-layer clipping with hooks grad sample mode #747

Conversation

iden-kalemaj commented Mar 24, 2025

facebook-github-bot commented Mar 24, 2025