Functorch gradients: investigation and fix #510

ffuuugor · 2022-09-16T16:12:48Z

The investigation part for this PR was done by @alexandresablayrolles, thanks for figuring out the reason the tests were failing

Background

Current implementation of functorch-based per sample gradients fails on modules which have both trainable non-recursive parameters and standard submodules, e.g. below

class LinearWithExtraParam(nn.Module):
    def __init__(self, in_features: int, out_features: int, hidden_dim: int = 8):
        super().__init__()
        self.fc = nn.Linear(in_features, hidden_dim)
        self.extra_param = nn.Parameter(torch.randn(hidden_dim, out_features))

    def forward(self, x):
        x = self.fc(x)
        x = x.matmul(self.extra_param)
        return x

The reason is - functorch hook actually computes gradients for recursive submodules too. The problem is, normal hooks are also attached to these submodules. GradSampleModule then sees two grad_sample tensors, thinks it needs to accumulate and adds them up together

Solution(s)

There are essentially two ways we can fix this: either make functorch compute per sample gradients for non-recursive parameters only or don't attach normal hooks to submodules where the parent module is handled by functorch.

This diff implements the latter option (reasoning below), for demo purposes the former option can be seen in #531

For the pure code perspective the former option (let's call it "non-recursive functorch") is more appealing to me. It better fits the existing paradigm and matches normal hooks behaviour - all of the existing code only deals with the immediate non-recursive parameters.
However, it doesn't make much sense from the efficiency perspective. "non-recursive functorch" would do all the work to compute per-sample gradients for its submodules, only for them to be filtered out at the very last stage.
Alternative option (a.k.a. "functorch for subtrees") does involve a bit more convoluted

This has a noticeable effect on performance.
Below is the results of MNIST benchmarks with different configurations. I've tested this with different configurations, because at the end of the day, the impact on performance depends on how deep are subtrees

Standard model- our model from MNIST example, standard layers only (2 conv + 2 linear). No overhead expected, functorch doesn't kick in
Mid-level model - leaf nodes (two linear layers) have one extra param and are computed with functorch. Overhead: 2x Linear hook
Extreme model - root model have one extra param and needs to be handled by functorch. Overhead: 2x linear hook + 2x conv hook

Mode	non-recursive functorch	functorch for subtrees
Standard model (CPU)	138s	136s
Standard model (GPU)	149s	150s
Mid-level model (CPU)	157s	150s
Mid-level model (GPU)	100s	97s
Extreme model (CPU)	207s	172s
Extreme model (GPU)	101s	94s

facebook-github-bot · 2022-10-26T19:13:40Z

@ffuuugor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-10-26T19:13:48Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-26T19:15:53Z

@ffuuugor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-10-26T19:16:00Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-26T19:27:57Z

@ffuuugor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-10-26T19:28:04Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-26T20:10:28Z

@ffuuugor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-10-26T20:10:34Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

alexandresablayrolles

Thanks a lot for implementing this solution and doing a proper benchmark. I like this latter option as I think it's more conceptually elegant and more performant. Also, the code for iterate_submodules is short and pretty self-explanatory so it matches the conceptual elegance of the idea!

alexandresablayrolles · 2022-10-27T12:35:24Z

opacus/grad_sample/grad_sample_module.py

+        if has_trainable_params(module):
+            yield module
+
+        # we'll apply functorch for the entire substree


Nit: can we replace by

# Don't recurse if module is handled by functorch

facebook-github-bot · 2022-10-27T14:40:02Z

@ffuuugor has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-27T15:36:53Z

@ffuuugor has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-28T12:27:47Z

@ffuuugor has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-10-28T12:28:21Z

@ffuuugor has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: *The investigation part for this PR was done by alexandresablayrolles, thanks for figuring out the reason the tests were failing* ## Background Current implementation of functorch-based per sample gradients fails on modules which have both trainable non-recursive parameters and standard submodules, e.g. below ``` class LinearWithExtraParam(nn.Module): def __init__(self, in_features: int, out_features: int, hidden_dim: int = 8): super().__init__() self.fc = nn.Linear(in_features, hidden_dim) self.extra_param = nn.Parameter(torch.randn(hidden_dim, out_features)) def forward(self, x): x = self.fc(x) x = x.matmul(self.extra_param) return x ``` The reason is - functorch hook actually computes gradients for recursive submodules too. The problem is, normal hooks are also attached to these submodules. GradSampleModule then sees two grad_sample tensors, thinks it needs to accumulate and adds them up together ## Solution(s) There are essentially two ways we can fix this: either make functorch compute per sample gradients for non-recursive parameters only or don't attach normal hooks to submodules where the parent module is handled by functorch. This diff implements the latter option (reasoning below), for demo purposes the former option can be seen in pytorch#531 For the pure code perspective the former option (let's call it "non-recursive functorch") is more appealing to me. It better fits the existing paradigm and matches normal hooks behaviour - all of the existing code only deals with the immediate non-recursive parameters. However, it doesn't make much sense from the efficiency perspective. "non-recursive functorch" would do all the work to compute per-sample gradients for its submodules, only for them to be filtered out at the very last stage. Alternative option (a.k.a. "functorch for subtrees") does involve a bit more convoluted This has a noticeable effect on performance. Below is the results of MNIST benchmarks with different configurations. I've tested this with different configurations, because at the end of the day, the impact on performance depends on how deep are subtrees * Standard model- our model from MNIST example, standard layers only (2 conv + 2 linear). No overhead expected, functorch doesn't kick in * Mid-level model - leaf nodes (two linear layers) have one extra param and are computed with functorch. Overhead: 2x Linear hook * Extreme model - root model have one extra param and needs to be handled by functorch. Overhead: 2x linear hook + 2x conv hook | Mode | non-recursive functorch | functorch for subtrees | |:-----------------------:|:------------------------:|:-----------------------:| | Standard model (CPU) | 138s | 136s | | Standard model (GPU) | 149s | 150s | | Mid-level model (CPU) | 157s | 150s | | Mid-level model (GPU) | 100s | 97s | | Extreme model (CPU) | 207s | 172s | | Extreme model (GPU) | 101s | 94s | Pull Request resolved: pytorch#510 Reviewed By: alexandresablayrolles Differential Revision: D39579487 Pulled By: ffuuugor fbshipit-source-id: 1b089bd04ab110174a1f2ebb371380eb2ce76054

functorch tests

1685719

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 16, 2022

ffuuugor mentioned this pull request Sep 16, 2022

functorch-based per sample gradients don't match with vanilla #511

Open

Igor Shilov added 2 commits October 26, 2022 11:43

Merge branch 'main' of github.com:pytorch/opacus into ffuuugor_476

7642579

fix double-computing grad_samples with functorch

ba61607

lint

e1feb93

ffuuugor changed the title ~~functorch tests~~ Functorch gradients: investigation and fix Oct 26, 2022

walk method rename

64448e1

ffuuugor requested review from karthikprasad, alexandresablayrolles and romovpa and removed request for alexandresablayrolles October 26, 2022 20:08

ffuuugor marked this pull request as ready for review October 26, 2022 20:08

disable functorch tests if not available

6d165ac

alexandresablayrolles approved these changes Oct 27, 2022

View reviewed changes

comment upd

90f48f6

relative imports for utils

735a607

facebook-github-bot closed this in 7393ae4 Oct 28, 2022

karthikprasad deleted the ffuuugor_476 branch November 4, 2022 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functorch gradients: investigation and fix #510

Functorch gradients: investigation and fix #510

ffuuugor commented Sep 16, 2022 •

edited

Loading

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

alexandresablayrolles left a comment

alexandresablayrolles Oct 27, 2022

facebook-github-bot commented Oct 27, 2022

facebook-github-bot commented Oct 27, 2022

facebook-github-bot commented Oct 28, 2022

facebook-github-bot commented Oct 28, 2022

Functorch gradients: investigation and fix #510

Functorch gradients: investigation and fix #510

Conversation

ffuuugor commented Sep 16, 2022 • edited Loading

Background

Solution(s)

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

facebook-github-bot commented Oct 26, 2022

alexandresablayrolles left a comment

Choose a reason for hiding this comment

alexandresablayrolles Oct 27, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Oct 27, 2022

facebook-github-bot commented Oct 27, 2022

facebook-github-bot commented Oct 28, 2022

facebook-github-bot commented Oct 28, 2022

ffuuugor commented Sep 16, 2022 •

edited

Loading