-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch to core for RFC: Sparse Domain Isolation for Supporting Large-scale Sparse Weights #41371
Conversation
@yuefengz @tanzhenyu @byronyi @alextp Hi, this is the code patch for RFC of Sparse Domain Isolation for Supporting large-scale Sparse Weights Training , and hope you have time to help review, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow
follow |
AFAICT it's possible to have the dynamic embedding ops entirely in a third-party repo, and we just need the trainable interface part of this PR to be in core TF. Is that true? |
That is true -- and I have the same comments. Let's keep what needs to be changed in lookup table, and leave the rest of it in the new SIG/repo |
@alextp @tanzhenyu Yes, we need the trainable interface part of this PR to be in core especially the part of |
cached_value=cached_value) | ||
|
||
def update_op(self): | ||
return self.params.upsert(self.ids, self.read_value(False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multi workers update the same embeddings, only one update will take effect? Or maybe one worker read the embeddings for training, and return grad with a long time delay, the pre-trained embeddings may be covered by this update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multi workers update the same embeddings, only one update will take effect? Or maybe one worker read the embeddings for training, and return grad with a long time delay, the pre-trained embeddings may be covered by this update?
Good question. That's a common problem in asynchronous training. To fix that we read again (from hash tables) when applying gradients to variables and slots: here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And another question, does this support synchronous training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And another question, does this support synchronous training?
Yes, The RFC is compatible with all distributed strategy of TensorFlow, not only PS-Worker mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the local variable on each replicas sync with each other and push the gradients to lookup hash table in terms of synchronous training?
Plus, as one more time of I/O has been introduced due to local variables, will it cause training speed degradation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the local variable on each replicas sync with each other and push the gradients to lookup hash table in terms of synchronous training?
Plus, as one more time of I/O has been introduced due to local variables, will it cause training speed degradation?
AFAICT, the strategy you mentioned is hybrid-parallelism, this paper maybe helpful : https://arxiv.org/abs/1909.04823
Is there any corresponding patch in TF-Serving ? How to deal with the problem of huge memory used by hash table on a single machine ? |
@lilao might have better ideas on this for TF Serving.
|
This feature will be helpful for our recommend system, hope to use as soon as possible, follow. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
There are several conflicts. Can you solve them, please? |
@rhdong Can you please resolve conflicts? Thanks! |
""" | ||
|
||
partition_index = self.partition_fn(keys, self.shard_num) | ||
keys_partitions, _ = _partition(keys, partition_index, self.shard_num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose we have keys in range [0, 10000] and shard_num = 8, then partition_0 will got keys [0, 8, 16, 24, ...., 10000], and MutableHashTableOfTensors will store these keys in a std::unordered_map container.
However, unordered_map uses collision chaining to resolve hash collisions and slot_index = hash(key) % bucket_count, suppose bucket_count was huge like 2^20, thus keys' slot_index will be [0, 8, 16, 24, ...., 10000], that means we only use 1/shard_num slots of bucket_count in unordered_map, this may leader to huge hash collision, and finally leader to poor performance.
Suggestion:
- provide a int/int64 hash function like murmur hash
- one-to-one mapping keys to another format like new_keys = keys / shard_num
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rangjiaheng ,thanks for your comments, that's really a problem and I will consider your suggestion, but before that maybe you can customize a partitioner
for Variable
to avoid this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rangjiaheng Any update on this PR? Please. Thanks!
@rhdong Can you please resolve conflicts? Thanks! |
@rhdong Any update on this PR? Please. Thanks! |
It has been 15 days with no activity and the |
I'm going to go ahead and close this PR, because it seems to have stalled. If you're still interested in pursing this (and responding to my comments), please feel free to reopen! |
Looks like the author will continue the work on recommenders-addons project. |
This is a patch to core for RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weight
Please visit tensorflow/community#237