Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternative functorch fix #531

Closed
wants to merge 1 commit into from
Closed

alternative functorch fix #531

wants to merge 1 commit into from

Conversation

ffuuugor
Copy link
Contributor

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Docs change / refactoring / dependency upgrade

Motivation and Context / Related issue

How Has This Been Tested (if it applies)

Checklist

  • The documentation is up-to-date with the changes I made.
  • I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
  • All tests passed, and additional code has been covered with new tests.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 26, 2022
facebook-github-bot pushed a commit that referenced this pull request Oct 28, 2022
Summary:
*The investigation part for this PR was done by alexandresablayrolles, thanks for figuring out the reason the tests were failing*

## Background
Current implementation of functorch-based per sample gradients fails on modules which have both trainable non-recursive parameters and standard submodules, e.g. below
```
class LinearWithExtraParam(nn.Module):
    def __init__(self, in_features: int, out_features: int, hidden_dim: int = 8):
        super().__init__()
        self.fc = nn.Linear(in_features, hidden_dim)
        self.extra_param = nn.Parameter(torch.randn(hidden_dim, out_features))

    def forward(self, x):
        x = self.fc(x)
        x = x.matmul(self.extra_param)
        return x
```

The reason is - functorch hook actually computes gradients for recursive submodules too. The problem is, normal hooks are also attached to these submodules. GradSampleModule then sees two grad_sample tensors, thinks it needs to accumulate and adds them up together

## Solution(s)

There are essentially two ways we can fix this: either make functorch compute per sample gradients for non-recursive parameters only or don't attach normal hooks to submodules where the parent module is handled by functorch.

This diff implements the latter option (reasoning below), for demo purposes the former option can be seen in #531

For the pure code perspective the former option (let's call it "non-recursive functorch") is more appealing to me. It better fits the existing paradigm and matches normal hooks behaviour - all of the existing code only deals with the immediate non-recursive parameters.
However, it doesn't make much sense from the efficiency perspective. "non-recursive functorch" would do all the work to compute per-sample gradients for its submodules, only for them to be filtered out at the very last stage.
Alternative option (a.k.a. "functorch for subtrees") does involve a bit more convoluted

This has a noticeable effect on performance.
Below is the results of MNIST benchmarks with different configurations. I've tested this with different configurations, because at the end of the day, the impact on performance depends on how deep are subtrees

* Standard model- our model from MNIST example, standard layers only (2 conv + 2 linear). No overhead expected, functorch doesn't kick in
* Mid-level model - leaf nodes (two linear layers) have one extra param and are computed with functorch. Overhead: 2x Linear hook
* Extreme model - root model have one extra param and needs to be handled by functorch. Overhead: 2x linear hook + 2x conv hook

| Mode                               | non-recursive functorch | functorch for subtrees |
|:-----------------------:|:------------------------:|:-----------------------:|
| Standard model (CPU)  |  138s                                 | 136s                                |
| Standard model (GPU)  |  149s                                 | 150s                                |
| Mid-level model (CPU)  |  157s                                 | 150s                                |
| Mid-level model (GPU)  |  100s                                 | 97s                                 |
| Extreme model (CPU)    |  207s                                 | 172s                               |
| Extreme model (GPU)    |  101s                                 | 94s                                |

Pull Request resolved: #510

Reviewed By: alexandresablayrolles

Differential Revision: D39579487

Pulled By: ffuuugor

fbshipit-source-id: 1b089bd04ab110174a1f2ebb371380eb2ce76054
@ffuuugor ffuuugor closed this Oct 28, 2022
psolikov pushed a commit to psolikov/opacus that referenced this pull request Nov 1, 2022
Summary:
*The investigation part for this PR was done by alexandresablayrolles, thanks for figuring out the reason the tests were failing*

## Background
Current implementation of functorch-based per sample gradients fails on modules which have both trainable non-recursive parameters and standard submodules, e.g. below
```
class LinearWithExtraParam(nn.Module):
    def __init__(self, in_features: int, out_features: int, hidden_dim: int = 8):
        super().__init__()
        self.fc = nn.Linear(in_features, hidden_dim)
        self.extra_param = nn.Parameter(torch.randn(hidden_dim, out_features))

    def forward(self, x):
        x = self.fc(x)
        x = x.matmul(self.extra_param)
        return x
```

The reason is - functorch hook actually computes gradients for recursive submodules too. The problem is, normal hooks are also attached to these submodules. GradSampleModule then sees two grad_sample tensors, thinks it needs to accumulate and adds them up together

## Solution(s)

There are essentially two ways we can fix this: either make functorch compute per sample gradients for non-recursive parameters only or don't attach normal hooks to submodules where the parent module is handled by functorch.

This diff implements the latter option (reasoning below), for demo purposes the former option can be seen in pytorch#531

For the pure code perspective the former option (let's call it "non-recursive functorch") is more appealing to me. It better fits the existing paradigm and matches normal hooks behaviour - all of the existing code only deals with the immediate non-recursive parameters.
However, it doesn't make much sense from the efficiency perspective. "non-recursive functorch" would do all the work to compute per-sample gradients for its submodules, only for them to be filtered out at the very last stage.
Alternative option (a.k.a. "functorch for subtrees") does involve a bit more convoluted

This has a noticeable effect on performance.
Below is the results of MNIST benchmarks with different configurations. I've tested this with different configurations, because at the end of the day, the impact on performance depends on how deep are subtrees

* Standard model- our model from MNIST example, standard layers only (2 conv + 2 linear). No overhead expected, functorch doesn't kick in
* Mid-level model - leaf nodes (two linear layers) have one extra param and are computed with functorch. Overhead: 2x Linear hook
* Extreme model - root model have one extra param and needs to be handled by functorch. Overhead: 2x linear hook + 2x conv hook

| Mode                               | non-recursive functorch | functorch for subtrees |
|:-----------------------:|:------------------------:|:-----------------------:|
| Standard model (CPU)  |  138s                                 | 136s                                |
| Standard model (GPU)  |  149s                                 | 150s                                |
| Mid-level model (CPU)  |  157s                                 | 150s                                |
| Mid-level model (GPU)  |  100s                                 | 97s                                 |
| Extreme model (CPU)    |  207s                                 | 172s                               |
| Extreme model (GPU)    |  101s                                 | 94s                                |

Pull Request resolved: pytorch#510

Reviewed By: alexandresablayrolles

Differential Revision: D39579487

Pulled By: ffuuugor

fbshipit-source-id: 1b089bd04ab110174a1f2ebb371380eb2ce76054
@karthikprasad karthikprasad deleted the ffuuugor_476a branch November 4, 2022 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants