Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA Graphs] Add option to dump a captured graph for debugging #85519

Closed
wants to merge 11 commits into from

Conversation

eqy
Copy link
Collaborator

@eqy eqy commented Sep 22, 2022

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 22, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85519

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c08266d:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@eqy eqy added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 28, 2022
@eqy eqy marked this pull request as ready for review September 28, 2022 22:06
AT_CUDA_CHECK(cudaGraphDestroy(graph_));
has_graph_ = false;
} else {
TORCH_WARN("DEBUG: TORCH_CUDAGRAPHS_DEBUG_PATH detected. graph_ will not be freed");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is outdated.

def debug_dump(self):
r"""
Calls a debugging function to dump the graph if the dump path has been
set via torch._C._cuda_setCudaGraphsDebugPath()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be torch._C._cuda_set_graphs_debug_path?

@eqy
Copy link
Collaborator Author

eqy commented Sep 29, 2022

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cuda_graphs_debugging onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_graphs_debugging && git pull --rebase)

TORCH_WARN("DEBUG: calling debug_dump()");
if (has_graph_) {
TORCH_WARN("DEBUG: calling cudaGraphDebugDotPrint() with ", _cuda_graphs_debug_path);
C10_CUDA_CHECK_WARN(cudaGraphDebugDotPrint(graph_, _cuda_graphs_debug_path.c_str(), 1<<10)); // most verbose output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can destroy graph_ now?

@@ -979,6 +984,10 @@ static struct PyMethodDef _THCPModule_methods[] = {
THCPModule_setBenchmarkLimitCuDNN,
METH_O,
nullptr},
{"_cuda_set_graphs_debug_path",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of _cuda_set_graphs_debug_path you should have something like enable_cudagraph_debug_mode that will preserve graph_ object for further inspection - you might want to do something other than dumping dot graph. As for path to dump the file, debug_dump itself should accept path argument - that would make it more consistent with other io functions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for path to dump the file, debug_dump itself should accept path argument - that would make it more consistent with other io functions.

👍 as this was also my first approach while testing the PR without looking at the code

@facebook-github-bot
Copy link
Contributor

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 4, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@eqy eqy changed the title [WIP][CUDA Graphs] Add option to dump a captured graph for debugging [CUDA Graphs] Add option to dump a captured graph for debugging Oct 28, 2022
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 31, 2022
@eqy
Copy link
Collaborator Author

eqy commented Nov 3, 2022

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cuda_graphs_debugging onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_graphs_debugging && git pull --rebase)

@eqy
Copy link
Collaborator Author

eqy commented Nov 22, 2022

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/85519/head returned non-zero exit code 1

Rebasing (1/12)
Auto-merging torch/csrc/cuda/Graph.cpp
CONFLICT (content): Merge conflict in torch/csrc/cuda/Graph.cpp
error: could not apply ca2cc45aad... check in
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply ca2cc45aad... check in

Raised by https://github.com/pytorch/pytorch/actions/runs/3526151366

@eqy
Copy link
Collaborator Author

eqy commented Nov 25, 2022

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cuda_graphs_debugging onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_graphs_debugging && git pull --rebase)

@eqy
Copy link
Collaborator Author

eqy commented Dec 6, 2022

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cuda_graphs_debugging onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_graphs_debugging && git pull --rebase)

@eqy
Copy link
Collaborator Author

eqy commented Dec 6, 2022

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants