Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed training support #258

Closed
yonigottesman opened this issue May 23, 2022 · 6 comments · Fixed by #262
Closed

distributed training support #258

yonigottesman opened this issue May 23, 2022 · 6 comments · Fixed by #262
Labels
component:losses Issues related to support additional metric learning technqiues (e.g loss) type:bug Something isn't working

Comments

@yonigottesman
Copy link
Contributor

When using distributed training the simclr loss should be calculated on all samples across gpus like here:
https://github.com/google-research/simclr/blob/master/objective.py
Is it planned to add this functionality?

@owenvallis
Copy link
Collaborator

The simclr loss subclasses keras.losses.Loss and should support summing over a distributed loss using the keras Reduction and distribute.Strategy.

Let me know if you are running into any specific issues though.

@yonigottesman
Copy link
Contributor Author

Hi @owenvallis
Im not sure you are correct :) . I think The reduction parameter only takes care of how to average the computed loss between gpus, after the loss was calculated.
In simclr the reason to have a large batch size is to have a bigger set of "negatives" for each positive example. each example will be compared to 2N-1 other examples. They show that the bigger the N the better the performance.
If you split the large batch to multiple gpus, each gpu will compute the contrastive loss only on its portion of the large batch,
resulting to maybe faster training than a single gpu but a smaller "simclr" batch. In order to compare each single example to the other global 2N-1 examples across gpus, the original implementation gathers all the other "hidden" embedding from all gpus here https://github.com/google-research/simclr/blob/master/tf2/objective.py#L60 and computes the loss.

What do you think?

@owenvallis owenvallis reopened this May 30, 2022
@owenvallis
Copy link
Collaborator

Thanks for clarifying, looks like you are correct here. I'll need to add support for the tpu_cross_replica_concat() function. I've opened this as a bug.

Let me know if you'd like to pick this up, otherwise I can try and get to it soonish but it might be a second before I have the time.

@owenvallis owenvallis added type:bug Something isn't working component:losses Issues related to support additional metric learning technqiues (e.g loss) labels Jun 3, 2022
@owenvallis owenvallis modified the milestone: 0.16 Jun 3, 2022
@yonigottesman
Copy link
Contributor Author

Yeah I would love to pick this one up and give it a go. Will open a pr when ready 😅

@yonigottesman
Copy link
Contributor Author

Hi there where some more issues with the implementation:

  • the loss is supposed to run twice: (za,zb) and (zb,za)
  • multiply by 0.5 for no reason - probably confusion with the Reduction
  • added strange "margin" - this is not in the original implementation what is it??
  • and of course the distributed issue :)

opened pr #262

@owenvallis
Copy link
Collaborator

Hi @yonigottesman,

Thanks for looking into this. So we do actually run both (za,zb) and (zb,za), it's just handled in the forward pass function in the contrastive model class. The multiply by 0.5 in the loss is just to scale the final summed loss back to an expected range for cosine distance, and the margin is just a small epsilon value to avoid 0 gradients.

I don't think the scaling or the margin value should impact the loss performance during training, but let me know if you see it causing any specific issues.

I also saw your test case in the pull request. I'll try and run some tests to compare our output to the original implementation, and I'll remove the margin and scaling for testing to make sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:losses Issues related to support additional metric learning technqiues (e.g loss) type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants