Skip to content

Conversation

@KickItLikeShika
Copy link
Contributor

@KickItLikeShika KickItLikeShika commented Mar 18, 2021

Fixes #1813

Extending sync_all_reduce API, added op arg to allow more reduction operations (SUM, PRODUCT, MAX, MIN)

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions bot added the module: metrics Metrics module label Mar 18, 2021
@KickItLikeShika
Copy link
Contributor Author

Already tested these updates on #1813 and it works fine

@sdesrozis
Copy link
Contributor

Thank you @KickItLikeShika, it looks good. However, unitary tests of this improvement are needed

  • test all the combination of op
  • catch the error if a wrong op (for instance "var:fakeop", "var:", etc.)

@KickItLikeShika
Copy link
Contributor Author

KickItLikeShika commented Mar 20, 2021

I will work on tests once we agree on the implementation

  • test all the combination of op

That's already done in torch.distirubted but i can do that in sync_all_reduce(), what do you think?

  • catch the error if a wrong op (for instance "var:fakeop", "var:", etc.)

@sdesrozis
Copy link
Contributor

sdesrozis commented Mar 20, 2021

Thank you !

That's already done in torch.distirubted but i can do that in sync_all_reduce(), what do you think?

Tests to update are here

def _test_distrib_sync_all_reduce_decorator(device):

We must not forgot the update of the documentation too.

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @KickItLikeShika ! A question about implementation.

@KickItLikeShika KickItLikeShika requested a review from vfdev-5 March 21, 2021 15:14
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
@sdesrozis your thoughts and your turn to review :)

Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KickItLikeShika Thank you ! LGTM

@KickItLikeShika
Copy link
Contributor Author

@vfdev-5 @sdesrozis please check the CI failures, we get an assertion error here assert (self.prod.cpu() == (self.prod_nocomp * 5) * (self.prod_nocomp * 5)).all(), But it works fine locally!
Screenshot from 2021-03-21 17-49-34

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 21, 2021

@KickItLikeShika do you mean XLA tests? Have you tested it with xla cpu locally ?

@KickItLikeShika
Copy link
Contributor Author

@KickItLikeShika
Copy link
Contributor Author

KickItLikeShika commented Mar 21, 2021

@vfdev-5 I realized the error, in the prod test i just considered world_size = 2, i will fix that, Thanks!

@KickItLikeShika KickItLikeShika force-pushed the extend-sync-all-reduce branch from e3a607a to 242ecea Compare March 21, 2021 16:54
@KickItLikeShika
Copy link
Contributor Author

@KickItLikeShika
Copy link
Contributor Author

@KickItLikeShika KickItLikeShika force-pushed the extend-sync-all-reduce branch from a9516dc to 4fb6a68 Compare March 22, 2021 15:59
@vfdev-5 vfdev-5 merged commit 341fa93 into pytorch:master Mar 22, 2021
@KickItLikeShika KickItLikeShika deleted the extend-sync-all-reduce branch March 22, 2021 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for max/min ops for sync_all_reduce

3 participants