-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Features] Enable Variable Partitioning in ParameterServerStrategy graph mode #23254
[Features] Enable Variable Partitioning in ParameterServerStrategy graph mode #23254
Conversation
@wangsiyu Hi, thank you so much for sending out this PR! Right now I am not able to import your PR. Is it because your repository is not up to date? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR mostly looks good to me. Have you tested it with two GPUs?
a = constant_op.constant([1.0, 2.0]) | ||
b = constant_op.constant([2.0, 3.0]) | ||
c = a + b | ||
self.assertEqual(a.device, worker_device + '/' + last_part_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part related to partitioned variable? Would you mind simplifying the test a little bit so that 1) there are no redundant testing logic and the result values of y
, z
and f
are more obvious?
Hi @yuefengz, thanks for your comments and I will check the merge compatibility and simplify the test case. |
Could you also try running unit tests with num_gpus=2? You don't have to have 2 GPUs to run that. Just to make sure |
1f23c15
to
2cfab14
Compare
@yuefengz I have simplified the unit test and add test case when num gpus > 1. It works with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits. Thank you for the change!
@@ -231,6 +231,9 @@ def _broadcast(self, tensor, destinations): | |||
destinations = self._compute_devices | |||
return self._cross_tower_ops.broadcast(tensor, destinations) | |||
|
|||
def _allow_variable_partition(self): | |||
return True if not context.executing_eagerly() else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just do return not context.executing_eagerly()
.
config=sess_config) as sess, \ | ||
d.scope(): | ||
|
||
# Define a variable outside the call_for_each_tower scope. This is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine to create the variable as long as it is under distribution strategy's scope. Could you remove this comment?
constraint=None): | ||
constraint=None, | ||
synchronization=VariableSynchronization.AUTO, | ||
aggregation=VariableAggregation.NONE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the documentation for the two new arguments?
@@ -1661,7 +1672,9 @@ def _get_partitioned_variable(name, | |||
partitioner=None, | |||
validate_shape=True, | |||
use_resource=None, | |||
constraint=None): | |||
constraint=None, | |||
synchronization=VariableSynchronization.AUTO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
@yuefengz Code have been refined. Please check again. |
@wangsiyu Thank you for your PR! Please let me know whether |
@yuefengz Yes. Currently |
@ymodak Could you please help merge this PR? Thank you! |
PiperOrigin-RevId: 220729932
call_for_each_replica, and call_for_each_tower is about to be replaced by call_for_each_replcia. PiperOrigin-RevId: 220820779
Hi @yuefengz ,
Variable Partitioning is very important in Parameter Server architecture for loading balancing. It has been widely used in Recommendation systems for distributing the large embedding variables.
In
DistributionStrategy
architecture, the variable partitioner is ignored to all cases. I understand it will be complicated if we enable variables partitioner to all cases such asEager
. It may be even involved with thePartitionVariableScope
in TF 2.0 which will influencetf.Variable
declaration withtf.variable_creator_scope
. However, it is easy and suitable to support partitioning onParameterServerStrategy
in graph mode. Every subclasses ofDistributionStrategy
can override_allow_variable_partition
method to decide whether to enable it or not. Currently, onlyParameterServerStrategy
has override it.It would be appreciated to have a discussion if there is another solutions to support variable partitioner.
Thanks.