[Features] Enable Variable Partitioning in ParameterServerStrategy graph mode #23254

wangsiyu · 2018-10-25T11:33:46Z

Variable Partitioning is very important in Parameter Server architecture for loading balancing. It has been widely used in Recommendation systems for distributing the large embedding variables.

In DistributionStrategy architecture, the variable partitioner is ignored to all cases. I understand it will be complicated if we enable variables partitioner to all cases such as Eager. It may be even involved with the PartitionVariableScope in TF 2.0 which will influence tf.Variable declaration with tf.variable_creator_scope. However, it is easy and suitable to support partitioning on ParameterServerStrategy in graph mode. Every subclasses of DistributionStrategy can override _allow_variable_partition method to decide whether to enable it or not. Currently, only ParameterServerStrategy has override it.

It would be appreciated to have a discussion if there is another solutions to support variable partitioner.

Thanks.

yuefengz · 2018-10-27T01:33:18Z

@wangsiyu Hi, thank you so much for sending out this PR! Right now I am not able to import your PR. Is it because your repository is not up to date?

yuefengz

The PR mostly looks good to me. Have you tested it with two GPUs?

yuefengz · 2018-10-26T22:49:41Z

tensorflow/contrib/distribute/python/parameter_server_strategy_test.py

+        a = constant_op.constant([1.0, 2.0])
+        b = constant_op.constant([2.0, 3.0])
+        c = a + b
+        self.assertEqual(a.device, worker_device + '/' + last_part_device)


Is this part related to partitioned variable? Would you mind simplifying the test a little bit so that 1) there are no redundant testing logic and the result values of y, z and f are more obvious?

wangsiyu · 2018-10-27T06:30:53Z

Hi @yuefengz, thanks for your comments and I will check the merge compatibility and simplify the test case.

yuefengz · 2018-10-28T05:05:24Z

Could you also try running unit tests with num_gpus=2? You don't have to have 2 GPUs to run that. Just to make sure AggregatingVariable works with PartitionedVariable. Thanks!

…aph mode

wangsiyu · 2018-10-29T14:17:43Z

@yuefengz I have simplified the unit test and add test case when num gpus > 1. It works with AggregatingVariable with VariableAggregation.SUM. And I have merge the code with the latest version. Are you able to import this now?

yuefengz

A few nits. Thank you for the change!

yuefengz · 2018-10-29T21:29:59Z

tensorflow/contrib/distribute/python/parameter_server_strategy.py

@@ -231,6 +231,9 @@ def _broadcast(self, tensor, destinations):
      destinations = self._compute_devices
    return self._cross_tower_ops.broadcast(tensor, destinations)

+  def _allow_variable_partition(self):
+    return True if not context.executing_eagerly() else False


You can just do return not context.executing_eagerly().

yuefengz · 2018-10-29T21:31:15Z

tensorflow/contrib/distribute/python/parameter_server_strategy_test.py

+                             config=sess_config) as sess, \
+         d.scope():
+
+      # Define a variable outside the call_for_each_tower scope. This is not


It is fine to create the variable as long as it is under distribution strategy's scope. Could you remove this comment?

yuefengz · 2018-10-29T21:32:49Z

tensorflow/python/ops/variable_scope.py

-                                constraint=None):
+                                constraint=None,
+                                synchronization=VariableSynchronization.AUTO,
+                                aggregation=VariableAggregation.NONE):


Could you update the documentation for the two new arguments?

yuefengz · 2018-10-29T21:33:37Z

tensorflow/python/ops/variable_scope.py

@@ -1661,7 +1672,9 @@ def _get_partitioned_variable(name,
                              partitioner=None,
                              validate_shape=True,
                              use_resource=None,
-                              constraint=None):
+                              constraint=None,
+                              synchronization=VariableSynchronization.AUTO,


wangsiyu · 2018-10-30T04:02:22Z

@yuefengz Code have been refined. Please check again.

yuefengz · 2018-10-30T04:16:53Z

@wangsiyu Thank you for your PR! Please let me know whether ParameterServerStrategy works in your case and what else you need. Feel free to send me emails.

wangsiyu · 2018-10-30T06:16:23Z

@yuefengz Yes. Currently ParameterServerStrategy has been used in training models for recommendation system in one business unit of Alibaba Group. And I am also trying push MirroredStrategy to other users. Actually, I have encountered some performance optimization problems and also want to have a discussion with you about how to improve them in DistributionStrategy framework. I will send PRs or e-mails to you when needed. Thanks very much.

yuefengz · 2018-11-08T02:02:41Z

@ymodak Could you please help merge this PR? Thank you!

PiperOrigin-RevId: 220729932

call_for_each_replica, and call_for_each_tower is about to be replaced by call_for_each_replcia. PiperOrigin-RevId: 220820779

googlebot added the cla: yes label Oct 25, 2018

wangsiyu mentioned this pull request Oct 25, 2018

About partitioned variable is Disabled in DistributionStrategy #22309

Closed

ymodak self-assigned this Oct 25, 2018

ymodak requested a review from yuefengz October 25, 2018 16:35

ymodak added the awaiting review Pull request awaiting review label Oct 25, 2018

yuefengz reviewed Oct 27, 2018

View reviewed changes

tensorflowbutler removed the awaiting review Pull request awaiting review label Oct 27, 2018

wangsiyu added 2 commits October 29, 2018 18:00

[Features] Enable Variable Partitioning in ParameterServerStrategy gr…

20cf490

…aph mode

simplifies unit test and running gpus number > 1

2cfab14

wangsiyu force-pushed the enable_var_part branch from 1f23c15 to 2cfab14 Compare October 29, 2018 14:15

yuefengz reviewed Oct 29, 2018

View reviewed changes

Add params documentations and refine code

bdd6057

yuefengz approved these changes Oct 30, 2018

View reviewed changes

ymodak added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Nov 2, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 2, 2018

tensorflow-copybara merged commit bdd6057 into tensorflow:master Nov 9, 2018

tensorflow-copybara pushed a commit that referenced this pull request Nov 9, 2018

Merge pull request #23254 from wangsiyu:enable_var_part

956d6a1

PiperOrigin-RevId: 220729932

tensorflow-copybara pushed a commit that referenced this pull request Nov 9, 2018

Refinements to PR #23254: it is totally fine to create variables outside

278e3e3

call_for_each_replica, and call_for_each_tower is about to be replaced by call_for_each_replcia. PiperOrigin-RevId: 220820779

byronyi mentioned this pull request Jan 17, 2019

RFC: Embedding and Partitioned Variables in TF 2.0 tensorflow/community#55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Features] Enable Variable Partitioning in ParameterServerStrategy graph mode #23254

[Features] Enable Variable Partitioning in ParameterServerStrategy graph mode #23254

wangsiyu commented Oct 25, 2018 •

edited

yuefengz commented Oct 27, 2018

yuefengz left a comment

yuefengz Oct 26, 2018

wangsiyu commented Oct 27, 2018

yuefengz commented Oct 28, 2018 •

edited

wangsiyu commented Oct 29, 2018

yuefengz left a comment

yuefengz Oct 29, 2018

yuefengz Oct 29, 2018

yuefengz Oct 29, 2018

yuefengz Oct 29, 2018

wangsiyu commented Oct 30, 2018

yuefengz commented Oct 30, 2018

wangsiyu commented Oct 30, 2018

yuefengz commented Nov 8, 2018

[Features] Enable Variable Partitioning in ParameterServerStrategy graph mode #23254

[Features] Enable Variable Partitioning in ParameterServerStrategy graph mode #23254

Conversation

wangsiyu commented Oct 25, 2018 • edited

yuefengz commented Oct 27, 2018

yuefengz left a comment

Choose a reason for hiding this comment

yuefengz Oct 26, 2018

Choose a reason for hiding this comment

wangsiyu commented Oct 27, 2018

yuefengz commented Oct 28, 2018 • edited

wangsiyu commented Oct 29, 2018

yuefengz left a comment

Choose a reason for hiding this comment

yuefengz Oct 29, 2018

Choose a reason for hiding this comment

yuefengz Oct 29, 2018

Choose a reason for hiding this comment

yuefengz Oct 29, 2018

Choose a reason for hiding this comment

yuefengz Oct 29, 2018

Choose a reason for hiding this comment

wangsiyu commented Oct 30, 2018

yuefengz commented Oct 30, 2018

wangsiyu commented Oct 30, 2018

yuefengz commented Nov 8, 2018

wangsiyu commented Oct 25, 2018 •

edited

yuefengz commented Oct 28, 2018 •

edited