Skip to content

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented Feb 24, 2022

Currently only the detection recipe supports this:

torch.distributed.init_process_group(
backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank
)
torch.distributed.barrier()

Adding barriers to all other recipes.

@mrshenli clarified offline that in theory, we shouldn't need that, as init_process_group already performed a store-based barrier (see here). The dist.barrier is basically an empty allreduce. But to be on the safe side, we opted for adding this in all references.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 24, 2022

💊 CI failures summary and remediations

As of commit de22c07 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

1 failure not recognized by patterns:

Job Step Action
CircleCI cmake_macos_cpu curl -o conda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
sh conda.sh -b
source $HOME/miniconda3/bin/activate
conda install -yq conda-build cmake
packaging/build_cmake.sh
🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@datumbox datumbox requested a review from mrshenli February 24, 2022 18:46
Copy link

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@datumbox datumbox merged commit e92f119 into pytorch:main Feb 24, 2022
@datumbox datumbox deleted the references/distr_barrier_after_init branch February 24, 2022 19:12
facebook-github-bot pushed a commit that referenced this pull request Feb 25, 2022
Reviewed By: jdsgomes

Differential Revision: D34475320

fbshipit-source-id: 3d14fc76081e405c8e4e29645363edbecc5d8f5a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants