Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zero-1 bug for inferring local ranks #5936

Merged
merged 2 commits into from Dec 14, 2023

Conversation

YangFei1990
Copy link
Contributor

Fix a bug how zero-1 optimizer infer the local ranks. Before this PR zero-1 is doing self.local_rank = self.global_rank // len(self.sharding_groups), which might possibly introduce wrong results when distribution strategy is complicated. For example, assuming with PP=4, DP=8 on a single node, the DP groups will be like [[0-7], [8-15], [16-23], [24-31]]. However every rank in the same DP group will have same local rank for zero-1.

The fix is to use the existing sharding groups to infer the local rank, i.e. find the index of current rank in the sharding group that holds the rank.

@alanwaketan
Copy link
Collaborator

@YangFei1990 Can you provide some test cases to illustrate the problem? And how the fix fix the issue? Sorry but I don't quite follow your description.

@YangFei1990
Copy link
Contributor Author

Hi @alanwaketan for the example in the description, say if user put the groups as [[0-7], [8-15], [16-23], [24-31]], i.e. opt = ZeroRedundancyOptimizer(..., sharding_group=[[0-7], [8-15], [16-23], [24-31]], ...). With the existing local rank inferring method self.local_rank = self.global_rank // len(self.sharding_groups), for example global rank [0-7] will have all marked as local rank 0. However the local rank should be the rank of the group that the current rank belongs to, so the global rank [0-7] should also have local rank [0-7]. This PR is to address this issue by infer local rank from the index of its global rank in its sharding group.
If you want to add a test for this, could you share some guidance how could I add the test?

@alanwaketan
Copy link
Collaborator

That makes a lot of sense. I guess it's probably hard to make a test case for it unless you made a lot of mocking. I will just approve it.

@alanwaketan alanwaketan merged commit 5577dd7 into pytorch:master Dec 14, 2023
17 checks passed
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Dec 15, 2023
ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Dec 26, 2023
ManfeiBai added a commit that referenced this pull request Dec 27, 2023
Co-authored-by: Fei <33940270+YangFei1990@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants