Skip to content

Conversation

mengluy0125
Copy link
Contributor

@mengluy0125 mengluy0125 commented Feb 26, 2024

Summary:
We observed that stack nodes have missing exampe_value in DPA+FIRST, causing issue to further do split cat. Full error log: P1187633689.

pre grad graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GPUFOBWniTeB6s8DAN8z9sHTadpxbr0LAAAz

We found that it was introduced by the new stack nodes in the group batch fusion, thus we fix the bug to enable further split cat optimization.

Test Plan:

buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch

before fix: P1187633689

W0221 13:32:09.334000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_19
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_6
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_5
W0221 13:32:09.336000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_4
W0221 13:32:09.517000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20
W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_18
W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_17
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_15
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_14
W0221 13:32:09.522000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_16
W0221 13:32:09.524000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_12
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_11
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_13
W0221 13:32:09.527000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_9
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_8
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_10
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_7

after fix:
P1189491364

W0226 13:19:56.542000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16
W0226 13:19:56.543000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16
W0226 13:19:56.703000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20
W0226 13:19:56.707000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19
W0226 13:19:56.711000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18
W0226 13:19:56.713000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17

Differential Revision: D54140488

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames

Copy link

pytorch-bot bot commented Feb 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120655

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9d91ae6 with merge base f36e00b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54140488

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D54140488

@mengluy0125 mengluy0125 changed the title [PT2][Inductor] Example_value absent debug [PT2][Inductor] Fix "example_value" absent for stack nodes Feb 26, 2024
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 26, 2024
@mengluy0125 mengluy0125 force-pushed the export-D54140488 branch 3 times, most recently from 81296e0 to b379313 Compare February 27, 2024 18:30
Summary:
We observed that stack nodes have missing exampe_value in DPA+FIRST, causing issue to further do split cat. Full error log: P1187633689.

pre grad graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GPUFOBWniTeB6s8DAN8z9sHTadpxbr0LAAAz

We found that it was introduced by the new stack nodes in the group batch fusion, thus we fix the bug to enable further split cat optimization.

Test Plan:
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```
before fix: P1187633689
```
W0221 13:32:09.334000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_19
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_6
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_5
W0221 13:32:09.336000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_4
W0221 13:32:09.517000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20
W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_18
W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_17
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_15
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_14
W0221 13:32:09.522000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_16
W0221 13:32:09.524000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_12
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_11
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_13
W0221 13:32:09.527000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_9
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_8
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_10
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_7
```

after fix:
P1189491364
```
W0226 13:19:56.542000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16
W0226 13:19:56.543000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16
W0226 13:19:56.703000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20
W0226 13:19:56.707000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19
W0226 13:19:56.711000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18
W0226 13:19:56.713000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17
```

Reviewed By: jackiexu1992

Differential Revision: D54140488
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants