Patch to core for RFC: Sparse Domain Isolation for Supporting Large-scale Sparse Weights #41371

rhdong · 2020-07-14T09:17:49Z

This is a patch to core for RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weight
Please visit tensorflow/community#237

rhdong · 2020-07-14T13:21:36Z

@yuefengz @tanzhenyu @byronyi @alextp Hi, this is the code patch for RFC of Sparse Domain Isolation for Supporting large-scale Sparse Weights Training , and hope you have time to help review, thank you!

wjwx

follow

FanGhost · 2020-07-15T09:44:28Z

follow

alextp · 2020-07-17T20:49:24Z

AFAICT it's possible to have the dynamic embedding ops entirely in a third-party repo, and we just need the trainable interface part of this PR to be in core TF. Is that true?

tanzhenyu · 2020-07-17T21:42:10Z

AFAICT it's possible to have the dynamic embedding ops entirely in a third-party repo, and we just need the trainable interface part of this PR to be in core TF. Is that true?

That is true -- and I have the same comments. Let's keep what needs to be changed in lookup table, and leave the rest of it in the new SIG/repo

rhdong · 2020-07-18T05:24:28Z

@alextp @tanzhenyu Yes, we need the trainable interface part of this PR to be in core especially the part of optimizer.py and TrainableWrapper which are the key to compatibility with all native optimizers without requiring extend them one by one, and I believe it's inappropriate and very difficult to be spilt them out in design considerations.

yejw5 · 2020-07-18T08:32:11Z

tensorflow/python/ops/resource_variable_ops.py

+        cached_value=cached_value)
+
+  def update_op(self):
+    return self.params.upsert(self.ids, self.read_value(False))


If multi workers update the same embeddings, only one update will take effect? Or maybe one worker read the embeddings for training, and return grad with a long time delay, the pre-trained embeddings may be covered by this update?

If multi workers update the same embeddings, only one update will take effect? Or maybe one worker read the embeddings for training, and return grad with a long time delay, the pre-trained embeddings may be covered by this update?

Good question. That's a common problem in asynchronous training. To fix that we read again (from hash tables) when applying gradients to variables and slots: here

And another question, does this support synchronous training?

And another question, does this support synchronous training?

Yes, The RFC is compatible with all distributed strategy of TensorFlow, not only PS-Worker mode.

How will the local variable on each replicas sync with each other and push the gradients to lookup hash table in terms of synchronous training?
Plus, as one more time of I/O has been introduced due to local variables, will it cause training speed degradation?

How will the local variable on each replicas sync with each other and push the gradients to lookup hash table in terms of synchronous training?
Plus, as one more time of I/O has been introduced due to local variables, will it cause training speed degradation?

AFAICT, the strategy you mentioned is hybrid-parallelism, this paper maybe helpful : https://arxiv.org/abs/1909.04823

evanzhen · 2020-07-21T04:30:27Z

Is there any corresponding patch in TF-Serving ? How to deal with the problem of huge memory used by hash table on a single machine ?

byronyi · 2020-07-21T05:29:38Z

@lilao might have better ideas on this for TF Serving.

Is there any corresponding patch in TF-Serving ? How to deal with the problem of huge memory used by hash table on a single machine ?

fff-2013 · 2020-08-18T13:42:49Z

This feature will be helpful for our recommend system, hope to use as soon as possible, follow.

google-cla · 2020-10-22T01:10:09Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

google-cla · 2020-10-22T02:47:20Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

mihaimaruseac · 2020-10-28T17:49:22Z

There are several conflicts. Can you solve them, please?

gbaned · 2020-11-05T13:10:54Z

@rhdong Can you please resolve conflicts? Thanks!

evanzhen · 2020-11-09T05:50:55Z

tensorflow/python/ops/dynamic_embedding_ops.py

+    """
+
+    partition_index = self.partition_fn(keys, self.shard_num)
+    keys_partitions, _ = _partition(keys, partition_index, self.shard_num)


Suppose we have keys in range [0, 10000] and shard_num = 8, then partition_0 will got keys [0, 8, 16, 24, ...., 10000], and MutableHashTableOfTensors will store these keys in a std::unordered_map container.
However, unordered_map uses collision chaining to resolve hash collisions and slot_index = hash(key) % bucket_count, suppose bucket_count was huge like 2^20, thus keys' slot_index will be [0, 8, 16, 24, ...., 10000], that means we only use 1/shard_num slots of bucket_count in unordered_map, this may leader to huge hash collision, and finally leader to poor performance.
Suggestion:

provide a int/int64 hash function like murmur hash

one-to-one mapping keys to another format like new_keys = keys / shard_num

Hi @rangjiaheng ,thanks for your comments, that's really a problem and I will consider your suggestion, but before that maybe you can customize a partitioner for Variable to avoid this problem.

@rangjiaheng Any update on this PR? Please. Thanks!

This PR is one part of RFC:tensorflow/community#237

gbaned · 2020-11-24T14:47:27Z

@rhdong Can you please resolve conflicts? Thanks!

gbaned · 2020-12-14T17:00:55Z

@rhdong Any update on this PR? Please. Thanks!

tensorflowbutler · 2020-12-31T00:55:17Z

It has been 15 days with no activity and the awaiting response label was assigned. Is this PR still valid? Assigning the stalled label. Please comment to reassure me that this is still being worked on.

gbaned · 2021-01-25T16:42:43Z

I'm going to go ahead and close this PR, because it seems to have stalled. If you're still interested in pursing this (and responding to my comments), please feel free to reopen!

xiaogaozi · 2021-01-27T08:35:08Z

Looks like the author will continue the work on recommenders-addons project.

rhdong requested a review from annarev as a code owner July 14, 2020 09:17

google-ml-butler bot added the size:XL CL Change Size:Extra Large label Jul 14, 2020

googlebot added the cla: yes label Jul 14, 2020

rhdong force-pushed the master branch from 4ece09f to 784670e Compare July 14, 2020 09:22

gbaned self-assigned this Jul 14, 2020

gbaned added this to Assigned Reviewer in PR Queue via automation Jul 14, 2020

rhdong mentioned this pull request Jul 14, 2020

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training. tensorflow/community#237

Open

wjwx reviewed Jul 15, 2020

View reviewed changes

tanzhenyu requested review from tanzhenyu and removed request for annarev July 15, 2020 15:01

yejw5 reviewed Jul 18, 2020

View reviewed changes

rhdong closed this Jul 21, 2020

PR Queue automation moved this from Assigned Reviewer to Closed/Rejected Jul 21, 2020

rhdong reopened this Jul 21, 2020

PR Queue automation moved this from Closed/Rejected to Assigned Reviewer Jul 21, 2020

rhdong changed the title ~~Patch to core for RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weight~~ Patch to core for RFC: Sparse Domain Isolation for Supporting Large-scale Sparse Weights Jul 27, 2020

gbaned requested review from tanzhenyu and removed request for tanzhenyu July 29, 2020 17:39

gbaned added the awaiting review Pull request awaiting review label Jul 29, 2020

rhdong force-pushed the master branch from 7e59990 to 0dfd384 Compare August 11, 2020 06:52

gbaned requested review from tanzhenyu and removed request for tanzhenyu August 13, 2020 16:20

rhdong requested review from kkimdev, mdanatg, mihaimaruseac and qqfish as code owners October 22, 2020 01:08

google-cla bot added cla: no and removed cla: yes labels Oct 22, 2020

rhdong force-pushed the master branch from 87d2f44 to 5eb7698 Compare October 22, 2020 02:46

[fix] Sparse weights were not randomly initialized by full size.

bd8733d

google-cla bot added cla: yes and removed cla: no labels Oct 28, 2020

gbaned added stat:awaiting response Status - Awaiting response from author and removed awaiting review Pull request awaiting review labels Oct 29, 2020

evanzhen reviewed Nov 9, 2020

View reviewed changes

kkimdev removed their request for review November 12, 2020 02:29

rhdong and others added 2 commits November 19, 2020 09:40

Mutablehashtable lookup support full size dynamic default values.

ee3c445

This PR is one part of RFC:tensorflow/community#237

[fix]report resource_variable_ops not defined error

6254e97

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Nov 22, 2020

gbaned added the stat:awaiting response Status - Awaiting response from author label Dec 4, 2020

tensorflowbutler added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Dec 31, 2020

gbaned closed this Jan 25, 2021

PR Queue automation moved this from Assigned Reviewer to Closed/Rejected Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch to core for RFC: Sparse Domain Isolation for Supporting Large-scale Sparse Weights #41371

Patch to core for RFC: Sparse Domain Isolation for Supporting Large-scale Sparse Weights #41371

rhdong commented Jul 14, 2020

rhdong commented Jul 14, 2020

wjwx left a comment

FanGhost commented Jul 15, 2020

alextp commented Jul 17, 2020

tanzhenyu commented Jul 17, 2020

rhdong commented Jul 18, 2020 •

edited

Loading

yejw5 Jul 18, 2020 •

edited

Loading

rhdong Jul 20, 2020 •

edited

Loading

yejw5 Jul 20, 2020 •

edited

Loading

rhdong Jul 20, 2020

liyinhgqw Jul 20, 2020 •

edited

Loading

rhdong Jul 21, 2020 •

edited

Loading

evanzhen commented Jul 21, 2020

byronyi commented Jul 21, 2020

fff-2013 commented Aug 18, 2020

google-cla bot commented Oct 22, 2020

google-cla bot commented Oct 22, 2020

mihaimaruseac commented Oct 28, 2020

gbaned commented Nov 5, 2020

evanzhen Nov 9, 2020 •

edited

Loading

rhdong Nov 9, 2020

gbaned Nov 18, 2020

gbaned commented Nov 24, 2020

gbaned commented Dec 14, 2020

tensorflowbutler commented Dec 31, 2020

gbaned commented Jan 25, 2021

xiaogaozi commented Jan 27, 2021

Patch to core for RFC: Sparse Domain Isolation for Supporting Large-scale Sparse Weights #41371

Patch to core for RFC: Sparse Domain Isolation for Supporting Large-scale Sparse Weights #41371

Conversation

rhdong commented Jul 14, 2020

rhdong commented Jul 14, 2020

wjwx left a comment

Choose a reason for hiding this comment

FanGhost commented Jul 15, 2020

alextp commented Jul 17, 2020

tanzhenyu commented Jul 17, 2020

rhdong commented Jul 18, 2020 • edited Loading

yejw5 Jul 18, 2020 • edited Loading

Choose a reason for hiding this comment

rhdong Jul 20, 2020 • edited Loading

Choose a reason for hiding this comment

yejw5 Jul 20, 2020 • edited Loading

Choose a reason for hiding this comment

rhdong Jul 20, 2020

Choose a reason for hiding this comment

liyinhgqw Jul 20, 2020 • edited Loading

Choose a reason for hiding this comment

rhdong Jul 21, 2020 • edited Loading

Choose a reason for hiding this comment

evanzhen commented Jul 21, 2020

byronyi commented Jul 21, 2020

fff-2013 commented Aug 18, 2020

google-cla bot commented Oct 22, 2020

google-cla bot commented Oct 22, 2020

mihaimaruseac commented Oct 28, 2020

gbaned commented Nov 5, 2020

evanzhen Nov 9, 2020 • edited Loading

Choose a reason for hiding this comment

rhdong Nov 9, 2020

Choose a reason for hiding this comment

gbaned Nov 18, 2020

Choose a reason for hiding this comment

gbaned commented Nov 24, 2020

gbaned commented Dec 14, 2020

tensorflowbutler commented Dec 31, 2020

gbaned commented Jan 25, 2021

xiaogaozi commented Jan 27, 2021

rhdong commented Jul 18, 2020 •

edited

Loading

yejw5 Jul 18, 2020 •

edited

Loading

rhdong Jul 20, 2020 •

edited

Loading

yejw5 Jul 20, 2020 •

edited

Loading

liyinhgqw Jul 20, 2020 •

edited

Loading

rhdong Jul 21, 2020 •

edited

Loading

evanzhen Nov 9, 2020 •

edited

Loading