Skip to content

Conversation

voznesenskym
Copy link
Collaborator

@voznesenskym voznesenskym commented Oct 1, 2023

keep_graph in compiled_autograd

keep_graph in compiled_autograd

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 1, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110367

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit a606c35 with merge base efdf155 (image):

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

"test_hook_closure_cycle_use_custom_function_False_use_tensor_hook_True", # compiled_autograd does not support keep_graph
"test_hook_closure_cycle_use_custom_function_True_use_tensor_hook_False", # compiled_autograd does not support keep_graph
"test_hook_closure_cycle_use_custom_function_True_use_tensor_hook_True", # compiled_autograd does not support keep_graph
"test_accumulate_grad", # RuntimeError: compiled_autograd does not support create_graph
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 tests passed now, but regenerated all to get new errors.

@voznesenskym voznesenskym requested a review from jansel October 1, 2023 22:28
Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like tests are failing

Comment on lines 33 to 39
def keep_graph(graph, keep_graph):
graph.keep_graph = keep_graph
return graph


def should_keep_graph(graph):
return graph.keep_graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be in python?

Copy link
Collaborator Author

@voznesenskym voznesenskym Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting breakpoints / debugging is easier ;)

I can move entirely to C++ setting, reading

keep_graph in compiled_autograd

keep_graph in compiled_autograd

cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
@voznesenskym
Copy link
Collaborator Author

Looks like tests are failing

Most of the actual tests seem to pass, I think its a static analysis catching (correctly) a falsey condition leading to uninitialized ptr access.

Fixed.

cache->compiled_fn = check(call_end_capture(py_compiler, state.outputs));
PyObject* graph = check(call_end_capture(py_compiler, state.outputs));
cache->compiled_fn =
check(call_keep_graph(check(graph), graph_task.keep_graph_));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed

@ezyang
Copy link
Contributor

ezyang commented Oct 4, 2023

I agree with the conclusion that retain_graph does not affect the graph structure (though, it WOULD if we let the compiled autograd graph steal all of the saved variables). Given this, most of the changes in this PR can be reverted and you end up with literally only the lines that toggle releasing variable at the end.

@albanD
Copy link
Collaborator

albanD commented Oct 4, 2023

Usually you only need retain_graph for double backwards like situations, are we running into this?

It is enabled when create_graph=True and thus in this situation it is yes.
It is also used when you need to backward a given model multiple times without changing it (like in GAN use cases).

@voznesenskym voznesenskym changed the title keep_graph in compiled_autograd retain_graph=True in compiled_autograd Oct 5, 2023
Adds support for retain_graph=True - known as keep_graph_ internally in the autograd engine. 

cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
voznesenskym added a commit that referenced this pull request Oct 5, 2023
keep_graph in compiled_autograd

keep_graph in compiled_autograd

ghstack-source-id: 6188646
Pull Request resolved: #110367

keep_graph in compiled_autograd

keep_graph in compiled_autograd

keep_graph in compiled_autograd

Fixes, feedback
@voznesenskym voznesenskym requested a review from jansel October 5, 2023 21:02
@voznesenskym
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 5, 2023
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@voznesenskym
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 14 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@voznesenskym
Copy link
Collaborator Author

@pytorchbot merge -f "Flaky CI"

@voznesenskym
Copy link
Collaborator Author

image

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Oct 6, 2023
This change is needed after pytorch/test-infra#4579 and pytorch/test-infra#4610.  All invalid cancelled signals have been removed from Dr.CI and HUD.  So trymerge should ignore them accordingly for a consistent experience.

### Testing

#110367 (comment) is the PR where a bunch of invalid cancelled signals showed up and blocked merges

Pull Request resolved: #110690
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
@facebook-github-bot facebook-github-bot deleted the gh/voznesenskym/232/head branch October 9, 2023 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants