Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow "must recompute" in torch.compile + selective checkpointing (SAC) #114036

Closed
wants to merge 31 commits into from

Conversation

yf225
Copy link
Contributor

@yf225 yf225 commented Nov 19, 2023

As titled. Done by setting up special edge from X_in to sink, so that min-cut partitioner will always put X_in in the backward graph (thus always recomputing X).

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

Copy link

pytorch-bot bot commented Nov 19, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114036

Note: Links to docs will display an error until the docs builds have been completed.

❌ 48 Cancelled Jobs

As of commit 7a1ac76 with merge base c5b9ee7 (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@yf225 yf225 added the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Nov 19, 2023
@Chillee
Copy link
Contributor

Chillee commented Nov 19, 2023

I don't understand the motivation here - when is this desired?

@yf225
Copy link
Contributor Author

yf225 commented Nov 19, 2023

I don't understand the motivation here - when is this desired?

@Chillee Some internal GenAI teams have a AC autotuner where they make a few trial runs and sort through what modules have low compute/high memory and then make a customized AC plan. They compressed 7 days of manual tuning and got equal or better with the autotuning in a couple hours. (cc. @lessw2020)

I believe allowing user to specify both "what to recompute" (this PR) and "what not to recompute" (already exists) will allow them to have maximum flexibility.

Alternatively we can expose the partitioning heuristics to them and allow them to influence / overwrite it. But it won't be as straightforward to them as asking them to just specify what ops to recompute, and let them reuse their own autotuning infra.

Copy link

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jan 19, 2024
@drisspg
Copy link
Contributor

drisspg commented Jan 23, 2024

I think that this is useful: pytorch-labs/float8_experimental#185

Comment on lines 866 to 872
if must_recompute(node):
# If user explicitly say they want to recompute a node, we honor it
# by adding an 0-capacity edge from the source and an inf edge to the sink.
# This way, X_in node is guaranteed to be part of the subgraph that contains "sink"
# after the cut, thus guaranteeing that X op will be recomputed.
add_edge("source", node.name + "_in", capacity=0)
add_edge(node.name + "_in", "sink", capacity=math.inf)
Copy link
Contributor

@anijain2305 anijain2305 Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mic-cut partitioner is supposed to give better cut on top of user-provided annotation. So, we should maybe not do this.

But if you want to do this deliberately (because FSDP does something funky), a better way might be to just have another config flag that respects the user AC annotation. In that case, you can just check if flag_enabled and must_recompile() and add the two edges, without any other changes anywhere.

Conceptually, its better than having 3 levels - must, ban and prefer. The new experimental flag, I am suggesting, says that partitioner will respect user AC annotation, so its either must or ban recompute.

@github-actions github-actions bot closed this Feb 28, 2024
@yf225 yf225 reopened this Jun 20, 2024
@yf225 yf225 requested review from soulitzer, anijain2305 and Chillee and removed request for Chillee June 20, 2024 23:01
@yf225 yf225 removed suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) Stale labels Jun 20, 2024
@yf225
Copy link
Contributor Author

yf225 commented Jun 20, 2024

I believe there is appetite for adding this API now, so resurrecting this PR.

yf225 added a commit that referenced this pull request Jun 22, 2024
…checkpointing (SAC)

ghstack-source-id: dfc5d8f7d50844b2fa49726ed114a55c014ab89e
Pull Request resolved: #129295
@yf225
Copy link
Contributor Author

yf225 commented Jun 22, 2024

Closing in favor of #129295, to leverage ghstack.

@yf225 yf225 closed this Jun 22, 2024
yf225 added a commit that referenced this pull request Jun 25, 2024
…checkpointing (SAC)

ghstack-source-id: 7a9f0491de59fc37fb98c8b9e6848b6f7524a624
Pull Request resolved: #129295
pytorchmergebot pushed a commit that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants