Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autograd] Use in-place input accumulation fast path for dense Tensors. #88339

Closed
wants to merge 7 commits into from

Conversation

robieta
Copy link

@robieta robieta commented Nov 2, 2022

Stack from ghstack (oldest at bottom):

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.

Differential Revision: D40946601

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 2, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88339

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 835e532:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@@ -86,13 +86,12 @@ static void accumulate(
} else {
buffer[pos] = var + old_var;
}
} else if (
old_var.is_contiguous() && old_var.use_count() == 1 &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to take this path when both var and old_var are sparse, who even knows if we have working inplace sparse-sparse addition, and in any case it's a conceptually weird operation, so you might want to push this case to the else branch.

Copy link
Author

@robieta robieta Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If old_var is sparse it will get hooked by the first part of the conditional. (Which has a !var.is_sparse() on its steal check.) Maybe I should add a cautionary comment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah that would be good, thanks for checking!

@robieta robieta added module: autograd Related to torch.autograd, and the autograd engine in general release notes: autograd release notes category labels Nov 2, 2022
@robieta
Copy link
Author

robieta commented Nov 2, 2022

Looks like I also need a !old_var._is_zerotensor() check. @anjali411 is there a more general mutability check that I should use?

…ense Tensors."

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
} else if (
old_var.is_contiguous() && !old_var._is_zerotensor() &&
old_var.use_count() == 1 && old_var.storage().use_count() == 1) {
buffer[pos] = old_var.add_(var);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a few things wrong here that need updating. In particular looking at the logic for AccumulateGrad show a lot of possible failure cases here:

// "Gradient Layout Contract"

  • We need to check that grad mode is disabled. This inplace might be invalid if old_var is saved by autograd
  • You need to guard for .is_sparse_csr() which is not covered by .is_sparse()
  • We want is_non_overlapping_and_dense on top of is_contiguous as we really don't want weird memory overlapping Tensors to be written inplace.
  • Why the special case on zero_tensor here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check that grad mode is disabled. This inplace might be invalid if old_var is saved by autograd

Should that be GradMode::is_enabled() or a property of old_var?

SGTM on is_sparse_csr and is_non_overlapping_and_dense.

Why the special case on zero_tensor here?

ZeroTensors are immutable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_non_overlapping_and_dense on top of is_contiguous

instead of is contiguous? Yeah that's reasonable. IIRC it errors out on sparse tensors though, so the logic will need to become more tortured

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I'm also updating the path where var is used, so there will be sparse checks before any other checks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be GradMode::is_enabled() or a property of old_var?

GradMode::is_enabled() because it might not be ok to modify old_var even if it doesn't require gradients.
Note that it could still be invalid when grad mode is disabled but that should be unlikely enough here that we're ok.

…ense Tensors."

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
@robieta
Copy link
Author

robieta commented Nov 2, 2022

Updated with more rigorous checks and comments.

@robieta
Copy link
Author

robieta commented Nov 2, 2022

The latest failure is torch.autograd.gradcheck.GradcheckError: While computing batched gradients, got: Cannot access storage of BatchedTensorImpl. Nested Tensor, sparse Tensor, ZeroTensor, BatchedTensor etc. will all blow up if various methods are called and it's virtually impossible to get everything right, much less future proof it. I'll fix the unit test failures, but I'm increasingly coming to the conclusion that we need to add something like is_plain_old_tensor() to Tensor.

@zou3519
Copy link
Contributor

zou3519 commented Nov 2, 2022

there's an isTensorSubclassLike check for this, though it is not completely comprehensive yet:

inline bool isTensorSubclassLike(const Tensor& tensor) {

We probably want to exclude tensors that check isTensorSubclassLike True from this fast path

@robieta
Copy link
Author

robieta commented Nov 3, 2022

there's an isTensorSubclassLike check for this, though it is not completely comprehensive yet:

inline bool isTensorSubclassLike(const Tensor& tensor) {

We probably want to exclude tensors that check isTensorSubclassLike True from this fast path

This is AWESOME! (CC @chaekit, @aaronenyeshi, @slgong-fb we should use this in profiler too.) AFAICT it doesn't check nested or any other C++ subclass? (And if not, do you think it should?) Plus it includes sparse and sparse_csr, so we shouldn't need those explicit checks.

…ense Tensors."

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Nov 3, 2022
Pull Request resolved: #88339

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.
ghstack-source-id: 172517219

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)
@robieta robieta requested a review from ngimel November 3, 2022 01:42
…ense Tensors."

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Nov 3, 2022
Pull Request resolved: #88339

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.
ghstack-source-id: 172596747

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)
@robieta
Copy link
Author

robieta commented Nov 3, 2022

I also had to gate on var not being a subclass because of a forward mode AD test. In theory I could have made the test check grad mode, but given how varied Tensor subclasses can be it seems like a better idea to just exclude them from the fast path. PTAL

…ense Tensors."

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Nov 7, 2022
Pull Request resolved: #88339

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.
ghstack-source-id: 172919004

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)
robieta pushed a commit that referenced this pull request Nov 8, 2022
Pull Request resolved: #88339

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.
ghstack-source-id: 172984864

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)
@robieta
Copy link
Author

robieta commented Nov 8, 2022

@pytorchbot merge -f "test failure is unrelated. (failed to install triton)"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@mehtanirav
Copy link
Contributor

@pytorchbot revert -m "Internal test failures" -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@robieta your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Nov 11, 2022
…e Tensors. (#88339)"

This reverts commit 8f66ae4.

Reverted #88339 on behalf of https://github.com/mehtanirav due to Internal test failures
robieta pushed a commit that referenced this pull request Dec 5, 2022
…e Tensors.

Identical to #88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Dec 5, 2022
…e Tensors.

Identical to #88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

ghstack-source-id: 175170552
Pull Request resolved: #90217
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…s. (pytorch#88339)

There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)

Pull Request resolved: pytorch#88339
Approved by: https://github.com/ngimel
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…e Tensors. (pytorch#88339)"

This reverts commit 8f66ae4.

Reverted pytorch#88339 on behalf of https://github.com/mehtanirav due to Internal test failures
robieta pushed a commit that referenced this pull request Jan 3, 2023
…e Tensors.

Pull Request resolved: #90217

Identical to #88339 except with a `.has_storage()` check before `.storage()`.
ghstack-source-id: 177028549

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)
robieta pushed a commit that referenced this pull request Jan 3, 2023
…mulation fast path for dense Tensors."

Identical to #88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Jan 3, 2023
…th for dense Tensors."

Identical to #88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Jan 31, 2023
…e Tensors.

Pull Request resolved: #90217

Identical to #88339 except with a `.has_storage()` check before `.storage()`.
ghstack-source-id: 178930904

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)
robieta pushed a commit that referenced this pull request Jan 31, 2023
…mulation fast path for dense Tensors."

Identical to #88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Jan 31, 2023
…th for dense Tensors."

Identical to #88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Feb 9, 2023
…e Tensors.

Pull Request resolved: #90217

Identical to #88339 except with a `.has_storage()` check before `.storage()`.
ghstack-source-id: 179779511

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)
robieta pushed a commit that referenced this pull request Feb 9, 2023
…mulation fast path for dense Tensors."

Identical to #88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Feb 9, 2023
…th for dense Tensors."

Identical to #88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Feb 10, 2023
…mulation fast path for dense Tensors."

Identical to #88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
robieta pushed a commit that referenced this pull request Feb 10, 2023
…e Tensors.

Pull Request resolved: #90217

Identical to #88339 except with a `.has_storage()` check before `.storage()`.
ghstack-source-id: 179910561

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)
robieta pushed a commit that referenced this pull request Feb 10, 2023
…th for dense Tensors."

Identical to #88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Feb 10, 2023
…e Tensors. (#90217)

Identical to #88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

Pull Request resolved: #90217
Approved by: https://github.com/ngimel
@facebook-github-bot facebook-github-bot deleted the gh/robieta/148/head branch June 8, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: autograd Related to torch.autograd, and the autograd engine in general release notes: autograd release notes category Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants