Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Features] Enable Variable Partitioning in ParameterServerStrategy graph mode #23254

Merged
merged 3 commits into from
Nov 9, 2018

Conversation

wangsiyu
Copy link
Contributor

@wangsiyu wangsiyu commented Oct 25, 2018

Hi @yuefengz ,

Variable Partitioning is very important in Parameter Server architecture for loading balancing. It has been widely used in Recommendation systems for distributing the large embedding variables.

In DistributionStrategy architecture, the variable partitioner is ignored to all cases. I understand it will be complicated if we enable variables partitioner to all cases such as Eager. It may be even involved with the PartitionVariableScope in TF 2.0 which will influence tf.Variable declaration with tf.variable_creator_scope. However, it is easy and suitable to support partitioning on ParameterServerStrategy in graph mode. Every subclasses of DistributionStrategy can override _allow_variable_partition method to decide whether to enable it or not. Currently, only ParameterServerStrategy has override it.

It would be appreciated to have a discussion if there is another solutions to support variable partitioner.

Thanks.

@ymodak ymodak self-assigned this Oct 25, 2018
@ymodak ymodak requested a review from yuefengz October 25, 2018 16:35
@ymodak ymodak added the awaiting review Pull request awaiting review label Oct 25, 2018
@yuefengz
Copy link
Contributor

@wangsiyu Hi, thank you so much for sending out this PR! Right now I am not able to import your PR. Is it because your repository is not up to date?

Copy link
Contributor

@yuefengz yuefengz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR mostly looks good to me. Have you tested it with two GPUs?

a = constant_op.constant([1.0, 2.0])
b = constant_op.constant([2.0, 3.0])
c = a + b
self.assertEqual(a.device, worker_device + '/' + last_part_device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part related to partitioned variable? Would you mind simplifying the test a little bit so that 1) there are no redundant testing logic and the result values of y, z and f are more obvious?

@wangsiyu
Copy link
Contributor Author

Hi @yuefengz, thanks for your comments and I will check the merge compatibility and simplify the test case.

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Oct 27, 2018
@yuefengz
Copy link
Contributor

yuefengz commented Oct 28, 2018

Could you also try running unit tests with num_gpus=2? You don't have to have 2 GPUs to run that. Just to make sure AggregatingVariable works with PartitionedVariable. Thanks!

@wangsiyu
Copy link
Contributor Author

@yuefengz I have simplified the unit test and add test case when num gpus > 1. It works with AggregatingVariable with VariableAggregation.SUM. And I have merge the code with the latest version. Are you able to import this now?

Copy link
Contributor

@yuefengz yuefengz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits. Thank you for the change!

@@ -231,6 +231,9 @@ def _broadcast(self, tensor, destinations):
destinations = self._compute_devices
return self._cross_tower_ops.broadcast(tensor, destinations)

def _allow_variable_partition(self):
return True if not context.executing_eagerly() else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just do return not context.executing_eagerly().

config=sess_config) as sess, \
d.scope():

# Define a variable outside the call_for_each_tower scope. This is not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine to create the variable as long as it is under distribution strategy's scope. Could you remove this comment?

constraint=None):
constraint=None,
synchronization=VariableSynchronization.AUTO,
aggregation=VariableAggregation.NONE):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the documentation for the two new arguments?

@@ -1661,7 +1672,9 @@ def _get_partitioned_variable(name,
partitioner=None,
validate_shape=True,
use_resource=None,
constraint=None):
constraint=None,
synchronization=VariableSynchronization.AUTO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@wangsiyu
Copy link
Contributor Author

@yuefengz Code have been refined. Please check again.

@yuefengz
Copy link
Contributor

@wangsiyu Thank you for your PR! Please let me know whether ParameterServerStrategy works in your case and what else you need. Feel free to send me emails.

@wangsiyu
Copy link
Contributor Author

@yuefengz Yes. Currently ParameterServerStrategy has been used in training models for recommendation system in one business unit of Alibaba Group. And I am also trying push MirroredStrategy to other users. Actually, I have encountered some performance optimization problems and also want to have a discussion with you about how to improve them in DistributionStrategy framework. I will send PRs or e-mails to you when needed. Thanks very much.

@ymodak ymodak added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Nov 2, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 2, 2018
@yuefengz
Copy link
Contributor

yuefengz commented Nov 8, 2018

@ymodak Could you please help merge this PR? Thank you!

@tensorflow-copybara tensorflow-copybara merged commit bdd6057 into tensorflow:master Nov 9, 2018
tensorflow-copybara pushed a commit that referenced this pull request Nov 9, 2018
tensorflow-copybara pushed a commit that referenced this pull request Nov 9, 2018
call_for_each_replica, and call_for_each_tower is about to be replaced by
call_for_each_replcia.

PiperOrigin-RevId: 220820779
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants