Skip to content

Conversation

ahkush
Copy link
Contributor

@ahkush ahkush commented Sep 23, 2025

Fixes #162129. Added validation in _rank_not_in_group() to check if FakeProcessGroup is properly initialized before use, raising a clear error message if torch.distributed.init_process_group(backend='fake') hasn't been called first.
This prevents silent failures and ensures proper dispatch system integration for all distributed operations.

Added test case test_fake_process_group_direct_usage_error() that validates the error is raised for all_reduce and all_to_all_single operations.

Please let me know if additional distributed operators should be tested or if any other updates are needed.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

Copy link

pytorch-bot bot commented Sep 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163665

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e5fe478 with merge base bac0f28 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 23, 2025
@ezyang
Copy link
Contributor

ezyang commented Sep 24, 2025

The test feels too late to me. Why can't you discover something bad happened earlier?

@jbschlosser jbschlosser requested review from d4l3k, kwen2501 and wconstab and removed request for d4l3k and kwen2501 September 24, 2025 17:07
@jbschlosser jbschlosser added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module topic: bug fixes topic category labels Sep 24, 2025
@ahkush ahkush force-pushed the fake-process-group-direct-usage-error branch from 50cfa4a to ebc113a Compare September 25, 2025 19:27
@ahkush
Copy link
Contributor Author

ahkush commented Sep 25, 2025

I moved the validation to the beginning of each distributed operation instead of waiting until _rank_not_in_group. I added it to every operator because:

  1. _group_or_default_group(group) is another common function used in most operators, but in many operators it's called after _rank_not_in_group,
  2. I couldn't add it in the operators in FakeProcessGroup.hpp as they're also called later in the function flow, and
  3. there's no other common entry point where I can place a single check that catches all cases early enough before they hit the dispatch system.
    I'd appreciate any suggestions for a better approach!

@ezyang
Copy link
Contributor

ezyang commented Sep 26, 2025

Yeah, this is cure is worse than the disease, I think.

What if we blocked direct construction of FakeProcessGroup entirely? Instead, the "official" APIs would have to do some private API that gets around this blockage.

@ahkush
Copy link
Contributor Author

ahkush commented Sep 30, 2025

Thanks for pointing me toward the right approach. I've implemented blocking direct construction of FakeProcessGroup entirely:
Changes made:

  • Made the constructor private and added a static _create_internal() method for official APIs
  • Public __init__ now throws a clear error directing users to use torch.distributed.init_process_group(backend='fake')
  • Updated all internal usage to use _create_internal()
  • Added tests covering both the error case and proper dispatch behavior
    This ensures users get proper dispatch system integration while maintaining backward compatibility for official APIs. The error message guides users toward the correct usage pattern.
    Does this approach look good, or would you like any adjustments to the implementation or error message?

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a solution to the issue filed by @ezyang is for dist.all_reduce to directly call into torch.ops.c10d.all_reduce_, instead of pg.all_reduce.

Comment on lines -24 to 27
"""
return FakeProcessGroup(
return FakeProcessGroup._create_internal(
common_opts.group_rank, common_opts.group_size, backend_opts
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a bc break?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _create_fake_pg() function itself has no BC break - same signature, same behavior, still returns a FakeProcessGroup. Only direct FakeProcessGroup() construction breaks (intentionally), which gets a clear error message directing users to the proper API. Internal utilities and type checking continue working unchanged.

@ahkush
Copy link
Contributor Author

ahkush commented Oct 1, 2025

@kwen2501
While that would ensure dispatch integration, it wouldn't solve the core issue. The dispatch system requires process groups to be registered in the GroupRegistry (via resolve_process_group()), but directly constructed FakeProcessGroup instances are never registered. So torch.ops.c10d.all_reduce_ would fail with "Could not resolve the process group" for direct constructions.

Also, should users be able to construct FakeProcessGroup directly at all, or is it better to guide them toward the official init_process_group(backend='fake') API for proper integration?

Do you have any suggestions for a better approach that would address the dispatch integration issue while handling the registration requirement? I'd appreciate your thoughts on this.

self.process_group = process_group
if self._use_fake_all_gather or self._use_fake_reduce:
self._fake_process_group = FakeProcessGroup(
self._fake_process_group = FakeProcessGroup._create_internal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually kind of skeptical about this use site, it sort of feels like potentially this is buggy LOL

c10::intrusive_ptr<Options> options = c10::make_intrusive<Options>())
: Backend(rank, size), options_(std::move(options)) {}
c10::intrusive_ptr<Options> options = c10::make_intrusive<Options>()) {
return c10::intrusive_ptr<FakeProcessGroup>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make_intrusive_ptr

private:
// Private constructor used by official APIs
FakeProcessGroup(int rank, int size, c10::intrusive_ptr<Options> options)
: Backend(rank, size), options_(std::move(options)) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is as important to hide the ctor on the C++ side

@ezyang
Copy link
Contributor

ezyang commented Oct 2, 2025

This looks good. Unfortunately you need to rebase

@ahkush ahkush force-pushed the fake-process-group-direct-usage-error branch from 5606d73 to e5fe478 Compare October 2, 2025 15:03
@ezyang
Copy link
Contributor

ezyang commented Oct 2, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 2, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Oct 15, 2025
)

These happen when building with CMAKE_BUILD_TYPE=RelWithAssert

This should fix two types of failures that started with #163665

Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now)

The first one type is

Truncated:
```
    default_pg, _ = _new_process_group_helper(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper
    backend_class = creator_fn(dist_backend_opts, backend_options)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg
    return FakeProcessGroup._create_internal(
RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero.
Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
#7 c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) from ??:0
#8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance*, void const*) from init.cpp:0
#9 pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) from :0
#10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > (*)(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
```
and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead.  However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR

The other one type is
```
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import
    self.assertEqual(out, "")
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != ''
- /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode.
-   if is_available() and not torch._C._c10d_init():

To execute this test, run the following from the base repo dir:
    python test/test_testing.py TestImports.test_no_warning_on_import
```
which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one?

Pull Request resolved: #165479
Approved by: https://github.com/ezyang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category topic: bug fixes topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Direct construction of FakeProcessGroup doesn't raise errors but has silent incorrectness

6 participants