Acyclic partition patch #86511

jjsjann123 · 2022-10-07T23:34:49Z

Refactored graph partition to check for cyclic dependency on each partition merge, instead of relying on a pre-baked dependency map.

The previous implementation suffers from not updating dependency on existing partition. When a fusion happens, the updated dependency map needs to be propagated to all nodes in the graph, so each node in a partition shares an identical dependency set. Previous implementation suffers from the not identifying cyclic dependency in issue #86159.

Updated implementation does a cyclic check on partitioned graph before attempting a merge of two partitions.

python repro added with cyclic dependency after partition TestFXGraphPasses.forward12
fix dependency map with updated implementation using cyclic check

pytorch-bot · 2022-10-07T23:34:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86511

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a80075b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jjsjann123 · 2022-10-08T00:01:03Z

Now I regret not having this stacked with #86452 😿

…into wip2

torch/fx/passes/infra/partitioner.py

SherlockNoMad · 2022-10-10T19:23:34Z

torch/fx/passes/infra/partitioner.py

+            # check if merge would create cyclic dependency.
+            for node in merged_nodes:
+                for user_node in node.users:
+                    if user_node not in merged_nodes and dfs_find_cycle(user_node):


ah... this is a lot of DFS... and duplicated computation...
is performance a concern here?

I am wondering if it's better to use a dynamic maintained dependency_map to keep track of this... the look up complexity for dependency_map would be O(1)...

this is a lot of DFS... and duplicated computation...
I am wondering if it's better to use a dynamic maintained dependency_map to keep track of this

I don't think there is a lot of duplicated computation . We are traversing each node strictly once for each attempt to merge two nodes, since we record all visited.
If we are to cache a dependency_map, when a merge succeeds, we need to traverse and update the dependency_map for every node to reflect the updated dependency map.

So between those two approach, I don't think one is obviously faster then the other. Not caching the dependency map definitely looks much simpler on the implementation side and we don't have to constantly deal with modifying a nested table. If I'm to take a guess I tend to think that this would run faster in more cases 😆

I actually ran into the error like max depth of recursion is reached 🥲

Ha... That's.... a pleasant surprise to see that we actually run into big graphs for fusion

I should have some bandwidth tomorrow to refactor this. Should be pretty straightforward. I'll cc you in the PR once I get it to work.
Also if you don't hear from me in a few days, feel free to ping and yell at me to get me back on it 😆

Here we go: #91042

torch/fx/passes/infra/partitioner.py

SherlockNoMad

Principally, I am fine with this change. The only concern is the performance issue of doing a lot of DFS. :) But I guess compilation time is not too big of a concern at this moment.

cc @wschin as you are also using this Partitioner.

facebook-github-bot · 2022-10-10T21:21:08Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jjsjann123 · 2022-10-10T23:31:21Z

@davidberard98 I saw you imported this PR. I'm wondering if it's OK for me to merge via the bot.

Since this is not an nvfuser code-bump PR. I don't see the need to go through internal workflow for extra safety 🦺

davidberard98 · 2022-10-10T23:38:13Z

@jjsjann123 yes, this PR is fine to merge via the bot. I just wanted to trigger some internal tests since I know this is used internally. Better to catch any issues now than have to wait for a revert & reland.

jjsjann123 · 2022-10-10T23:47:18Z

@jjsjann123 yes, this PR is fine to merge via the bot. I just wanted to trigger some internal tests since I know this is used internally. Better to catch any issues now than have to wait for a revert & reland.

Thanks a lot for keeping an eye out for us here.

jjsjann123 · 2022-10-10T23:47:27Z

@pytorchbot merge

pytorchmergebot · 2022-10-10T23:48:49Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…91042) Follow up on PR #86511 Python's 1000 limit on recursion depth is not practical for us to run cyclic check on larger graphs. This refactor avoids that issue. Pull Request resolved: #91042 Approved by: https://github.com/kit1980

jjsjann123 and others added 15 commits October 6, 2022 22:47

add sibling fusion

d8f634c

debug print

86bb6d7

merge code path

c6aa59f

fixing return

58fe822

fixing tests

35bec39

debugging

48a63da

fixing logic

e0d992c

extra parenthesis

cf58bc4

remove prints

5f8f914

Merge remote-tracking branch 'origin/viable/strict' into HEAD

a41f324

lintrunner mypy

7afa7b3

mypy assignment

0fe5e9b

repro added

7abe8e7

repro added

2134a09

repro added

cf79b2b

pytorch-bot bot added the topic: not user facing topic category label Oct 7, 2022

pytorchbot added the open source label Oct 7, 2022

jjsjann123 added 11 commits October 7, 2022 22:42

Merge remote-tracking branch 'jjsjann/horizontal_fusion_partitioner' …

22b1d17

…into wip2

remove redundant tests

d24ebe7

Merge commit '67434c70df5df353944f6ba876d9dd06b669bacd' into wip2

fc7e63d

refactor partitioner

6e6df93

bug fix

52f130a

debug print

970fd2b

debug print

ecc9fea

reverse order

282c700

patch error

c23abd9

patch error

0fb6838

patch error

a5b9747

jjsjann123 requested review from davidberard98 and IvanYashchuk and removed request for davidberard98 and IvanYashchuk October 10, 2022 16:48

dagitses assigned SherlockNoMad Oct 10, 2022

dagitses added oncall: fx triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 10, 2022

jjsjann123 mentioned this pull request Oct 10, 2022

[PrimTorch] Invalid partition found when running TIMM hrnet_w18 #86159

Closed

SherlockNoMad reviewed Oct 10, 2022

View reviewed changes

torch/fx/passes/infra/partitioner.py Show resolved Hide resolved

SherlockNoMad reviewed Oct 10, 2022

View reviewed changes

torch/fx/passes/infra/partitioner.py Outdated Show resolved Hide resolved

SherlockNoMad reviewed Oct 10, 2022

View reviewed changes

torch/fx/passes/infra/partitioner.py Show resolved Hide resolved

SherlockNoMad approved these changes Oct 10, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 10, 2022

addressing review comments

a80075b

jjsjann123 mentioned this pull request Oct 10, 2022

Horizontal fusion partitioner #86452

Closed

pytorchmergebot added the Merged label Oct 10, 2022

pytorchmergebot closed this in 2cb330a Oct 10, 2022

jjsjann123 deleted the acyclic_partition_patch branch October 10, 2022 23:52

This was referenced Oct 11, 2022

[PrimTorch] Assertion error in partitioner #86698

Closed

CapabilityBasedPartitioner doesn't merge subgraphs for several outputs #86108

Closed

jjsjann123 mentioned this pull request Dec 16, 2022

refactor the dfs cyclic search from recursion to iterative approach #91042

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acyclic partition patch #86511

Acyclic partition patch #86511

jjsjann123 commented Oct 7, 2022 •

edited

pytorch-bot bot commented Oct 7, 2022 •

edited

jjsjann123 commented Oct 8, 2022

SherlockNoMad Oct 10, 2022

jjsjann123 Oct 10, 2022

cccclai Dec 15, 2022

jjsjann123 Dec 15, 2022

jjsjann123 Dec 16, 2022

SherlockNoMad left a comment

facebook-github-bot commented Oct 10, 2022

jjsjann123 commented Oct 10, 2022

davidberard98 commented Oct 10, 2022

jjsjann123 commented Oct 10, 2022

jjsjann123 commented Oct 10, 2022

pytorchmergebot commented Oct 10, 2022

Acyclic partition patch #86511

Acyclic partition patch #86511

Conversation

jjsjann123 commented Oct 7, 2022 • edited

pytorch-bot bot commented Oct 7, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86511

✅ No Failures

jjsjann123 commented Oct 8, 2022

SherlockNoMad Oct 10, 2022

Choose a reason for hiding this comment

jjsjann123 Oct 10, 2022

Choose a reason for hiding this comment

cccclai Dec 15, 2022

Choose a reason for hiding this comment

jjsjann123 Dec 15, 2022

Choose a reason for hiding this comment

jjsjann123 Dec 16, 2022

Choose a reason for hiding this comment

SherlockNoMad left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 10, 2022

jjsjann123 commented Oct 10, 2022

davidberard98 commented Oct 10, 2022

jjsjann123 commented Oct 10, 2022

jjsjann123 commented Oct 10, 2022

pytorchmergebot commented Oct 10, 2022

Merge started

jjsjann123 commented Oct 7, 2022 •

edited

pytorch-bot bot commented Oct 7, 2022 •

edited