[Feat]support horovod sync train #205

a6802739 · 2021-12-30T06:54:26Z

Use horovod to sync dense grad of the model parameter, for TrainableWrapper's grad, keep same with before.

rhdong · 2021-12-30T06:59:12Z

Hi @a6802739 , thank you for your contribution!

demo/dynamic_embedding/movielens-100k-sync-estimator/README.md

tensorflow_recommenders_addons/dynamic_embedding/python/ops/dynamic_embedding_optimizer.py

demo/dynamic_embedding/movielens-100k-sync-estimator/README.md

demo/dynamic_embedding/movielens-100k-sync-estimator/export.sh

tensorflow_recommenders_addons/dynamic_embedding/python/ops/dynamic_embedding_optimizer.py

Lifann · 2022-02-14T09:49:41Z

tensorflow_recommenders_addons/dynamic_embedding/python/ops/dynamic_embedding_optimizer.py

+            aggregated_grad.append(None)  # pass-through.
+            continue
+          elif isinstance(grad, ops.Tensor):
+            aggregated_grad.append(hvd.allreduce(grad, op=hvd.Sum))


And just for discussion, I noticed that Horovod implements lots of features and optimizations on its DistributedOptimizer like tensor fusion, grouped allreduce, adasum, gradient compression, etc.., in the latest-nth versions. Use the hvd.allreduce API may not reuse these features.

Yeah, For tensor fusion, it's set by environment variable, and it's turn on by default, for most recommend situation, I think there is no need to care about grouped allreduce or gradient compression. But I think we could find better way to open reduction operation to the user. If we directly let user specify hvd.Sum or hvd.adamsum, we should let them import horovod before apply_gradients, or we could let user specify a string the reduction method like sum, we map this method to the corresponding hvd op, but if horovod add some reduction op, we should change the code to be compatible with newest horovod version.

Lifann · 2022-02-14T09:52:04Z

tensorflow_recommenders_addons/dynamic_embedding/python/ops/dynamic_embedding_optimizer.py

+          if grad is None:
+            aggregated_grad.append(None)  # pass-through.
+            continue
+          elif isinstance(grad, ops.Tensor):


MLP layers, downstream from embedding, may also generate IndexedSlices gradient, need to notice that.

Sorry, I never notice MLP layers could generate IndexedSlices gradient, could you give me an example?

Sorry, I never notice MLP layers could generate IndexedSlices gradient, could you give me an example?

For example:

var = tf.Variable(...) params = de.get_variable(...) emb = de.embedding_lookup(...) latent_tensor = sum_pooling(emb) ... = some_func(latent_tensor, var)

The some_layer are defined as:

def some_func(latent, var): mask = tf.greater_equal(var, threshold) pos = tf.where(mask) selected = tf.gather(var, pos) ...

The some_func code will make the gradient become IndexedSlices to var.

I see, Thanks. Now I delelte the judgement isinstance(grad, ops.Tensor). So Horovod will check it's grad type and deal with it.

Lifann

LGTM

a6802739 requested a review from rhdong as a code owner December 30, 2021 06:54

rhdong changed the title ~~support horovod sync train~~ [feat]support horovod sync train Dec 30, 2021

rhdong changed the title ~~[feat]support horovod sync train~~ [Feat]support horovod sync train Dec 30, 2021