Skip to content

Conversation

davidberard98
Copy link
Contributor

@davidberard98 davidberard98 commented May 17, 2022

Stack from ghstack:

Original PR: #77295

Original commit message:
On GPU, conv errors if not all its inputs have the same dtype.

In the case of autocasting during freezing, what we see is:

  1. inputs to conv are casted to half
  2. inputs to batchnorm are not casted, so many are still floats
  3. we try to fold conv + batchnorm, by finding different weight and bias such that conv(input, new_weight, new_bias) is equivalent to the original conv -> batchnorm.

If conv previously had an optional bias, then during freezing we will temporarily create a zero-valued bias as a placeholder for conv_bias. We want to construct it to have the same dtype as the weight input to conv, to avoid errors on GPU.

Reland changes:
There's a memory leak from cuda caching allocator that is a side effect of this fix. The memory leak causes the test to fail, though for some reason it didn't fail on CI in the last PR. This skips the tests for now.

…s half"

Original PR: #77295

Original commit message:
On GPU, conv errors if not all its inputs have the same dtype.

In the case of autocasting during freezing, what we see is:
1) inputs to conv are casted to half
2) inputs to batchnorm are not casted, so many are still floats
3) we try to fold conv + batchnorm, by finding different weight and bias such that conv(input, new_weight, new_bias) is equivalent to the original conv -> batchnorm.

If conv previously had an optional bias, then during freezing we will temporarily create a zero-valued bias as a placeholder for conv_bias. We want to construct it to have the same dtype as the weight input to conv, to avoid errors on GPU.

Reland changes:
There's a memory leak from cuda caching allocator that is a side effect of this fix. The memory leak causes the test to fail, though for some reason it didn't fail on CI in the last PR. This skips the tests for now.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 17, 2022

🔗 Helpful links

✅ No Failures (1 Pending)

As of commit dc27594 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

davidberard98 added a commit that referenced this pull request May 17, 2022
…s half"

Original PR: #77295

Original commit message:
On GPU, conv errors if not all its inputs have the same dtype.

In the case of autocasting during freezing, what we see is:
1) inputs to conv are casted to half
2) inputs to batchnorm are not casted, so many are still floats
3) we try to fold conv + batchnorm, by finding different weight and bias such that conv(input, new_weight, new_bias) is equivalent to the original conv -> batchnorm.

If conv previously had an optional bias, then during freezing we will temporarily create a zero-valued bias as a placeholder for conv_bias. We want to construct it to have the same dtype as the weight input to conv, to avoid errors on GPU.

Reland changes:
There's a memory leak from cuda caching allocator that is a side effect of this fix. The memory leak causes the test to fail, though for some reason it didn't fail on CI in the last PR. This skips the tests for now.

ghstack-source-id: 5c401ff
Pull Request resolved: #77617
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label May 17, 2022
@jjsjann123
Copy link
Collaborator

There's a memory leak from cuda caching allocator that is a side effect of this fix. The memory leak causes the test to fail, though for some reason it didn't fail on CI in the last PR. This skips the tests for now.

I took a look at the failing test.

I noticed that the reported caching allocator leak is about in the same magnitude as of parameters in the model. Looking at the frozen graph, we are converting properties to prim::Constant in the graph. Looks like disabling that resolves the caching allocator failure for me locally. --> commenting out this line:

propagateAttributes(graph);

Not sure how that would only break with caching allocator 😕
cc'ing @ngimel

@davidberard98
Copy link
Contributor Author

@pytorchbot merge this please

@github-actions
Copy link
Contributor

Hey @davidberard98.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request May 18, 2022
…s half" (#77617)

Summary:
Original PR: #77295

Original commit message:
On GPU, conv errors if not all its inputs have the same dtype.

In the case of autocasting during freezing, what we see is:
1) inputs to conv are casted to half
2) inputs to batchnorm are not casted, so many are still floats
3) we try to fold conv + batchnorm, by finding different weight and bias such that conv(input, new_weight, new_bias) is equivalent to the original conv -> batchnorm.

If conv previously had an optional bias, then during freezing we will temporarily create a zero-valued bias as a placeholder for conv_bias. We want to construct it to have the same dtype as the weight input to conv, to avoid errors on GPU.

Reland changes:
There's a memory leak from cuda caching allocator that is a side effect of this fix. The memory leak causes the test to fail, though for some reason it didn't fail on CI in the last PR. This skips the tests for now.

Pull Request resolved: #77617

Approved by: https://github.com/eellison

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/d0dc7cb7743825eeb4eff804b22ec37cc1d78cb7

Reviewed By: atalman

Differential Revision: D36445937

Pulled By: davidberard98

fbshipit-source-id: 9af6869b948697c0464da2d8ad6ba0a40a42fd55
@facebook-github-bot facebook-github-bot deleted the gh/davidberard98/118/head branch May 20, 2022 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants