distributed training support #258

yonigottesman · 2022-05-23T13:37:00Z

When using distributed training the simclr loss should be calculated on all samples across gpus like here:
https://github.com/google-research/simclr/blob/master/objective.py
Is it planned to add this functionality?

owenvallis · 2022-05-26T23:08:31Z

The simclr loss subclasses keras.losses.Loss and should support summing over a distributed loss using the keras Reduction and distribute.Strategy.

Let me know if you are running into any specific issues though.

yonigottesman · 2022-05-27T10:50:15Z

Hi @owenvallis
Im not sure you are correct :) . I think The reduction parameter only takes care of how to average the computed loss between gpus, after the loss was calculated.
In simclr the reason to have a large batch size is to have a bigger set of "negatives" for each positive example. each example will be compared to 2N-1 other examples. They show that the bigger the N the better the performance.
If you split the large batch to multiple gpus, each gpu will compute the contrastive loss only on its portion of the large batch,
resulting to maybe faster training than a single gpu but a smaller "simclr" batch. In order to compare each single example to the other global 2N-1 examples across gpus, the original implementation gathers all the other "hidden" embedding from all gpus here https://github.com/google-research/simclr/blob/master/tf2/objective.py#L60 and computes the loss.

What do you think?

owenvallis · 2022-06-03T19:12:23Z

Thanks for clarifying, looks like you are correct here. I'll need to add support for the tpu_cross_replica_concat() function. I've opened this as a bug.

Let me know if you'd like to pick this up, otherwise I can try and get to it soonish but it might be a second before I have the time.

yonigottesman · 2022-06-03T19:16:31Z

Yeah I would love to pick this one up and give it a go. Will open a pr when ready 😅

yonigottesman · 2022-06-05T20:14:46Z

Hi there where some more issues with the implementation:

the loss is supposed to run twice: (za,zb) and (zb,za)
multiply by 0.5 for no reason - probably confusion with the Reduction
added strange "margin" - this is not in the original implementation what is it??
and of course the distributed issue :)

opened pr #262

owenvallis · 2022-06-05T21:02:25Z

Hi @yonigottesman,

Thanks for looking into this. So we do actually run both (za,zb) and (zb,za), it's just handled in the forward pass function in the contrastive model class. The multiply by 0.5 in the loss is just to scale the final summed loss back to an expected range for cosine distance, and the margin is just a small epsilon value to avoid 0 gradients.

I don't think the scaling or the margin value should impact the loss performance during training, but let me know if you see it causing any specific issues.

I also saw your test case in the pull request. I'll try and run some tests to compare our output to the original implementation, and I'll remove the margin and scaling for testing to make sure.

owenvallis closed this as completed May 26, 2022

owenvallis reopened this May 30, 2022

owenvallis added type:bug Something isn't working component:losses Issues related to support additional metric learning technqiues (e.g loss) labels Jun 3, 2022

owenvallis modified the milestone: 0.16 Jun 3, 2022

yonigottesman mentioned this issue Jun 5, 2022

Fix simclr loss to work with distributed training. #262

Merged

owenvallis closed this as completed in #262 Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed training support #258

distributed training support #258

yonigottesman commented May 23, 2022

owenvallis commented May 26, 2022

yonigottesman commented May 27, 2022

owenvallis commented Jun 3, 2022

yonigottesman commented Jun 3, 2022

yonigottesman commented Jun 5, 2022

owenvallis commented Jun 5, 2022

distributed training support #258

distributed training support #258

Comments

yonigottesman commented May 23, 2022

owenvallis commented May 26, 2022

yonigottesman commented May 27, 2022

owenvallis commented Jun 3, 2022

yonigottesman commented Jun 3, 2022

yonigottesman commented Jun 5, 2022

owenvallis commented Jun 5, 2022