Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat]support horovod sync train #205

Merged
merged 1 commit into from
Feb 16, 2022

Conversation

a6802739
Copy link
Contributor

@a6802739 a6802739 commented Dec 30, 2021

  • Use horovod to sync dense grad of the model parameter, for TrainableWrapper's grad, keep same with before.

@rhdong rhdong changed the title support horovod sync train [feat]support horovod sync train Dec 30, 2021
@rhdong rhdong changed the title [feat]support horovod sync train [Feat]support horovod sync train Dec 30, 2021
@rhdong
Copy link
Member

rhdong commented Dec 30, 2021

Hi @a6802739 , thank you for your contribution!

@a6802739 a6802739 force-pushed the horovod_sync_train branch 5 times, most recently from da36228 to 7475fe3 Compare December 30, 2021 13:43
aggregated_grad.append(None) # pass-through.
continue
elif isinstance(grad, ops.Tensor):
aggregated_grad.append(hvd.allreduce(grad, op=hvd.Sum))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just for discussion, I noticed that Horovod implements lots of features and optimizations on its DistributedOptimizer like tensor fusion, grouped allreduce, adasum, gradient compression, etc.., in the latest-nth versions. Use the hvd.allreduce API may not reuse these features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, For tensor fusion, it's set by environment variable, and it's turn on by default, for most recommend situation, I think there is no need to care about grouped allreduce or gradient compression. But I think we could find better way to open reduction operation to the user. If we directly let user specify hvd.Sum or hvd.adamsum, we should let them import horovod before apply_gradients, or we could let user specify a string the reduction method like sum, we map this method to the corresponding hvd op, but if horovod add some reduction op, we should change the code to be compatible with newest horovod version.

if grad is None:
aggregated_grad.append(None) # pass-through.
continue
elif isinstance(grad, ops.Tensor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLP layers, downstream from embedding, may also generate IndexedSlices gradient, need to notice that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I never notice MLP layers could generate IndexedSlices gradient, could you give me an example?

Copy link
Member

@Lifann Lifann Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I never notice MLP layers could generate IndexedSlices gradient, could you give me an example?

For example:

var = tf.Variable(...)
params = de.get_variable(...)
emb = de.embedding_lookup(...)
latent_tensor = sum_pooling(emb)
... = some_func(latent_tensor, var)

The some_layer are defined as:

def some_func(latent, var):
  mask = tf.greater_equal(var, threshold)
  pos = tf.where(mask)
  selected = tf.gather(var, pos)
  ...

The some_func code will make the gradient become IndexedSlices to var.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, Thanks. Now I delelte the judgement isinstance(grad, ops.Tensor). So Horovod will check it's grad type and deal with it.

@a6802739 a6802739 force-pushed the horovod_sync_train branch 3 times, most recently from 8f415dd to 635f55a Compare February 15, 2022 07:56
Copy link
Member

@Lifann Lifann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Lifann Lifann merged commit 6872e62 into tensorflow:master Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants