Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acyclic partition patch #86511

Closed
wants to merge 36 commits into from

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Oct 7, 2022

Fixes #86159 and #86108

Refactored graph partition to check for cyclic dependency on each partition merge, instead of relying on a pre-baked dependency map.

The previous implementation suffers from not updating dependency on existing partition. When a fusion happens, the updated dependency map needs to be propagated to all nodes in the graph, so each node in a partition shares an identical dependency set. Previous implementation suffers from the not identifying cyclic dependency in issue #86159.

Updated implementation does a cyclic check on partitioned graph before attempting a merge of two partitions.

  • python repro added with cyclic dependency after partition TestFXGraphPasses.forward12
  • fix dependency map with updated implementation using cyclic check

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 7, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86511

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a80075b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Oct 7, 2022
@jjsjann123
Copy link
Collaborator Author

Now I regret not having this stacked with #86452 😿

@jjsjann123 jjsjann123 requested review from davidberard98 and IvanYashchuk and removed request for davidberard98 and IvanYashchuk October 10, 2022 16:48
@dagitses dagitses added oncall: fx triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 10, 2022
# check if merge would create cyclic dependency.
for node in merged_nodes:
for user_node in node.users:
if user_node not in merged_nodes and dfs_find_cycle(user_node):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah... this is a lot of DFS... and duplicated computation...
is performance a concern here?

I am wondering if it's better to use a dynamic maintained dependency_map to keep track of this... the look up complexity for dependency_map would be O(1)...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a lot of DFS... and duplicated computation...
I am wondering if it's better to use a dynamic maintained dependency_map to keep track of this

I don't think there is a lot of duplicated computation . We are traversing each node strictly once for each attempt to merge two nodes, since we record all visited.
If we are to cache a dependency_map, when a merge succeeds, we need to traverse and update the dependency_map for every node to reflect the updated dependency map.

So between those two approach, I don't think one is obviously faster then the other. Not caching the dependency map definitely looks much simpler on the implementation side and we don't have to constantly deal with modifying a nested table. If I'm to take a guess I tend to think that this would run faster in more cases 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually ran into the error like max depth of recursion is reached 🥲

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha... That's.... a pleasant surprise to see that we actually run into big graphs for fusion

I should have some bandwidth tomorrow to refactor this. Should be pretty straightforward. I'll cc you in the PR once I get it to work.
Also if you don't hear from me in a few days, feel free to ping and yell at me to get me back on it 😆

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we go: #91042

Copy link
Contributor

@SherlockNoMad SherlockNoMad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Principally, I am fine with this change. The only concern is the performance issue of doing a lot of DFS. :) But I guess compilation time is not too big of a concern at this moment.

cc @wschin as you are also using this Partitioner.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 10, 2022
@facebook-github-bot
Copy link
Contributor

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jjsjann123
Copy link
Collaborator Author

@davidberard98 I saw you imported this PR. I'm wondering if it's OK for me to merge via the bot.

Since this is not an nvfuser code-bump PR. I don't see the need to go through internal workflow for extra safety 🦺

@davidberard98
Copy link
Contributor

@jjsjann123 yes, this PR is fine to merge via the bot. I just wanted to trigger some internal tests since I know this is used internally. Better to catch any issues now than have to wait for a revert & reland.

@jjsjann123
Copy link
Collaborator Author

@jjsjann123 yes, this PR is fine to merge via the bot. I just wanted to trigger some internal tests since I know this is used internally. Better to catch any issues now than have to wait for a revert & reland.

Thanks a lot for keeping an eye out for us here.

@jjsjann123
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jjsjann123 jjsjann123 deleted the acyclic_partition_patch branch October 10, 2022 23:52
pytorchmergebot pushed a commit that referenced this pull request Dec 20, 2022
…91042)

Follow up on PR #86511

Python's 1000 limit on recursion depth is not practical for us to run cyclic check on larger graphs. This refactor avoids that issue.
Pull Request resolved: #91042
Approved by: https://github.com/kit1980
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: fx.passes Optimization passes written in FX (don't forget to select a more specific label) oncall: fx open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PrimTorch] Invalid partition found when running TIMM hrnet_w18
9 participants