Skip to content

Conversation

jon-chuang
Copy link
Collaborator

@jon-chuang jon-chuang commented Sep 30, 2023

Shape is assumed by TensorMetadata to be torch.Shape/tuple, however, some of the scheduler node groups utilize int, so convert to tuple.

Root cause is actually foreach scheduler node having silent-error group of int, when in fact it ought to be opaque foreach.

Previously: silent error / confusing shape of (0,)
image

Now: clear that it is foreach which does not have well-defined shape:
image

Alternate might be to create list of shapes for each of its subnodes. Actually, for debuggability sake, I may prefer this. We can ensure that the recursive generation of this string is only done dynamically in a debug code path. Else, incrementally computing it on initialization of ForeachKernel may also be feasible. This is quite infeasible for 100s of params.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

@pytorch-bot pytorch-bot bot added the release notes: fx release notes category label Sep 30, 2023
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 30, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110336

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 874d400 with merge base 86196bf (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jon-chuang jon-chuang changed the title fix(fx-graph-debug): shape is not always well-typed fix(inductor): shape can be int for graph debug Oct 1, 2023
@jon-chuang jon-chuang changed the title fix(inductor): shape can be int for graph debug fix(inductor): ForeachKernelSchedulerNode group shape should be opaque for graph debug Oct 1, 2023
@jon-chuang jon-chuang force-pushed the jon-chuang/fix-graph-debug branch from 212719c to 979da14 Compare October 1, 2023 07:42
@jon-chuang jon-chuang force-pushed the jon-chuang/fix-graph-debug branch from 979da14 to f81e60b Compare October 1, 2023 07:48
@colesbury colesbury requested a review from eellison October 3, 2023 23:06
@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 3, 2023
@eellison eellison requested review from mlazos and removed request for eellison October 10, 2023 20:01
Copy link
Contributor

@mlazos mlazos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you possible list all of the shapes? that would be really cool. If not this is fine too

@jon-chuang
Copy link
Collaborator Author

Could you possible list all of the shapes? that would be really cool.

I think let's leave opaque for now. It'll be quite difficult to read when listing all shapes.

@jon-chuang
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 31, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm5.6-py3.8 / test (default, 1, 3, linux.rocm.gpu)

Details for Dev Infra team Raised by workflow job

@jon-chuang
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
…que for graph debug (pytorch#110336)

~~Shape is assumed by `TensorMetadata` to be torch.Shape/tuple, however, some of the scheduler node groups utilize `int`, so convert to tuple.~~

Root cause is actually `foreach` scheduler node having silent-error group of int, when in fact it ought to be opaque `foreach`.

**Previously:** silent error / confusing shape of (0,)
![image](https://github.com/pytorch/pytorch/assets/9093549/5bc2a3c7-151f-4433-bbf8-044c7b03e989)

**Now:** clear that it is foreach which does not have well-defined shape:
![image](https://github.com/pytorch/pytorch/assets/9093549/8373080d-4519-4e74-8a3b-da463e9968da)

~~Alternate might be to create list of shapes for each of its subnodes. Actually, for debuggability sake, I may prefer this. We can ensure that the recursive generation of this string is only done dynamically in a debug code path. Else, incrementally computing it on initialization of ForeachKernel may also be feasible.~~ This is quite infeasible for 100s of params.

Pull Request resolved: pytorch#110336
Approved by: https://github.com/mlazos
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
…que for graph debug (pytorch#110336)

~~Shape is assumed by `TensorMetadata` to be torch.Shape/tuple, however, some of the scheduler node groups utilize `int`, so convert to tuple.~~

Root cause is actually `foreach` scheduler node having silent-error group of int, when in fact it ought to be opaque `foreach`.

**Previously:** silent error / confusing shape of (0,)
![image](https://github.com/pytorch/pytorch/assets/9093549/5bc2a3c7-151f-4433-bbf8-044c7b03e989)

**Now:** clear that it is foreach which does not have well-defined shape:
![image](https://github.com/pytorch/pytorch/assets/9093549/8373080d-4519-4e74-8a3b-da463e9968da)

~~Alternate might be to create list of shapes for each of its subnodes. Actually, for debuggability sake, I may prefer this. We can ensure that the recursive generation of this string is only done dynamically in a debug code path. Else, incrementally computing it on initialization of ForeachKernel may also be feasible.~~ This is quite infeasible for 100s of params.

Pull Request resolved: pytorch#110336
Approved by: https://github.com/mlazos
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor open source release notes: fx release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants