Skip to content

Conversation

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155070

Note: Links to docs will display an error until the docs builds have been completed.

❌ 14 New Failures, 1 Unrelated Failure

As of commit 68feaac with merge base 065c446 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@albanD albanD removed their request for review June 4, 2025 00:09
@janeyx99 janeyx99 removed their request for review June 4, 2025 20:11
@qingyi-yan qingyi-yan force-pushed the main branch 2 times, most recently from d9dd5d1 to 3453f31 Compare June 5, 2025 18:48
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 6, 2025
@mikaylagawarecki mikaylagawarecki requested a review from yf225 June 6, 2025 22:03
@RabbitWhite1
Copy link
Contributor

@qingyi-yan Hi, this is a great work! Kindly ask if isend and irecv will be supported?

@qingyi-yan
Copy link
Author

@qingyi-yan Hi, this is a great work! Kindly ask if isend and irecv will be supported?

Thanks @RabbitWhite1 for the compliment. Right now I am focusing on getting this pull request merged. Supporting isend and irecv are certainly doable, but it depends on availability of my resources for this work. This availability is currently uncertain.

@qingyi-yan
Copy link
Author

Hi - Just checking --- I believe this pull request is ready for review and possibly merge. It has been a long time since i did my last update. Is there anything I need to do? Thanks.

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting case because by definition if we compile a send or recv op, our graph is non-spmd. We have been designing how the compiler optimizations should behave for distributed programs and making sure that the resulting program is still valid is very difficult unless we assume/enforce it is spmd (same graphs on every rank).

I think we should support capturing send/recv, but we should have some rules. If we capture a p2p op, we need to also make sure we are not doing unsafe collective optimizations, for example. For now I think all we need to do is raise an error if any of the spmd-mode flags or compiler passes are registered and we encounter a p2p op during tracing. What do folks think?

@bdhirsh would be the best person to advise on this issue though he is out for a week or two. Also cc @xmfan @ezyang

@ezyang
Copy link
Contributor

ezyang commented Aug 3, 2025

I mean we just have to actually implement spmd mode IMO.

@qingyi-yan
Copy link
Author

Agreed with @wconstab that more consistency checking would make the support of p2p ops safer. Waiting for more detailed feedback on the relevant rules (checks) that are needed.

@ezyang
Copy link
Contributor

ezyang commented Sep 2, 2025

I think this is basically reasonable. To address wconstab's concern, I suggest we only enable this is a config flag is set, and set it to False by default. I haven't reviewed the rest of the PR carefully but if you're willing to do the config flag I'll do the rest of the review.

@qingyi-yan
Copy link
Author

Yes, I agree to adding a config flag which is set to False by default. I will work on this and hope to have it ready in a week or so. Thanks for the feedback!

"torch.sparse_compressed_tensor": SkipFunctionVariable,
# Specially handle system-level communication functions
"torch.distributed.distributed_c10d.send": CommunicationFunctionVariable,
"torch.distributed.distributed_c10d.recv": CommunicationFunctionVariable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me understand why these aren't handled the same way as other traceable collectives?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recv function has an API that does not belong to the functional paradigm --- specifically the recv(variable) interface modifies the given variable. Another concern is that system-level communcations may modify the underlying system states, which means they may have unknown side effects. So I assumed it is not safe to treat them as functional collectives?

use_fallback = False

import torch.distributed.distributed_c10d as c10d
# Fall back to not enable autograd if mutation has to be supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the tests you added fail due to this? It feels like this is just a DCE problem? It should work to run AOTAutograd here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was mutations of the input parameter in the recv function. Because this mutation violates the functional paradigm assumption for functions, the parameter fails to be modified if the Autograd is enabled, due to the use of functional objects. I am not aware if there is a way around it, except if we adopt an alternative functional API for the recv function.

fn = fn_var.fn
return variables.TorchInGraphFunctionVariable(fn, nonstrict_traceable=True)
name = self.fn.__name__
print (f"name={name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to remove

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. Thanks for catching my carelessness!

tx.output.create_proxy("call_function", self.fn,
*proxy_args_kwargs(args, kwargs))
return variables.ConstantVariable(None)
return super().call_function(tx, args, kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not quite right. I think it would be better to use something analogous to traceable collectives remapping to support this. See this:

def _traceable_collective_remaps():
    # We can't rely on importing from distributed, since it's not always built
    if torch.distributed.is_available():
        from torch.distributed._functional_collectives import (
            traceable_collective_remaps,
        )

        return traceable_collective_remaps
    return {}


def _traceable_collectives_source(tx: "InstructionTranslator", fn):
    assert torch.distributed.is_available(), "Illegal invocation."
    assert fn in _traceable_collective_remaps().values()

    inner_name = fn.__name__
    path_source = tx.import_source("torch.distributed._functional_collectives")
    return AttrSource(path_source, inner_name)

Essentially, we need functional versions of send and recv. Then you can use the CollectiveFunctionRewriteVariable apparatus to get to the functional collective.

@qingyi-yan
Copy link
Author

Thank you @ezyang for the helpful feedback! I will try what you suggested and get back to you.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2025

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor fx module: compiled autograd compiled_autograd module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: jit Add this issue/PR to JIT oncall triage queue open source release notes: distributed (checkpoint) release notes: inductor (aoti) release notes: quantization release notes category release notes: releng release notes category Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

supporting dynamo compilation of end-to-end send/recv in distributed

7 participants