New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation with large lookup tables - optimizer._apply_sparse_duplicate_indices (TF V1.0.1) #10270
Comments
Forgot to mention - this is the GPU info:
|
I using docker base image:
|
Interesting timelines, and sorry you're running into this. Some/most of this time is likely copying to host memory: we don't actually have a GPU kernel for unique, but one needed to be registered to avoid interfering with op placements. The computation happens on the CPU. So this could be sped up by implementing a real GPU kernel for Unique. That's likely preferable to fusing Adam's sparse updates into a GPU kernel, although it is another possibility. So any interest in writing a GPU kernel for unique? I don't know of anyone who is working on one right now. |
@zhangyaobit @yzhwang since we discussed this. @KashiErez: One useful clarification would be how many indices are going into the UniqueOp, which should be equal to the number of embeddings that are accessed in each iteration (e.g. sentence length). |
Also @ekelsen, who has been thinking about using CUB as a way to implement these kinds of ops on the GPU. |
Yeah, hopefully CUB will be usable from TF soon. In that case, unique can be done by sorting and then doing run-length-encoding. |
Regarding Question: @KashiErez: One useful clarification would be how many indices are going into the UniqueOp, which should be equal to the number of embeddings that are accessed in each iteration (e.g. sentence length). Answer: The batch size is 1024. Regarding words: But We have another categorical feature that has only one value in each iteration. From the trace you can see that this categorical feature 'unique op' is running much slower then the word embedding 'unique op'. So I think that the parameter you should look at first is 'number of unique values' (== lookup table size). |
In that case, could you add a print node to get the exact shape of the Tensor going into unique? The whole idea of this code path is that the gradients are sparse; the IndexedSlices from the gradient of the embedding lookup has a number of indices equal to the number of embeddings which were actually accessed (which it sounds like should be ~21504 and ~1024?), independent of the size of the embedding lookup table. |
Hi, Not sure I understand where to add the print node. |
Hi, This is the optimizer I used to print the indices shape:
I ran it on 3 models, here are the results: Model 1 - few small categorical features (no more then 10K each):
Model 2 - small categorical features (no more then 10K each) + words (50K unique values):
Model 3 - small categorical features (no more then 10K each) + words (50K unique values) + big categorical feature (300K unique values):
Summing up indices sizes: Summing up, feature cardinality size and indices size are correlated. Small Note, regarding the gradient scopes - my_optimizer_0/gradients/concat* : my_optimizer_0 - this is my code. |
Well, that explains why the transfer times are in miliseconds: 3 megabytes of indices at ~1 gigabyte/sec, then take a round trip (even if they're all duplicate, the second result of unique() has the same size as its input). There's a separate question of why that many embeddings are accessed at each iteration (all of them?). It's definitely unusual for a language model. Regardless, clearly it would be nice to have a GPU kernel for unique. |
I met this problem too, so have it been solved? |
@ningyuwhut I'd suggest using the workaround @KashiErez mentioned for now if you don't care about deduplicating sparse gradients and want the previous behavior. This was a bug fix, so we can't just go back to the old behavior by default. AFAIK nobody is working on a GPU kernel for UniqueOp, but that still seems like the resolution here if you're interested in taking the bug. |
Hi @KashiErez ! Sorry for the late response. |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
您的邮件已经收到--阿杰
|
Closing as stale. Please reopen if you'd like to work on this further. |
Hi,
I ran into this performance issue while trying to upgrade tensorflow from version 0.12.1 to 1.X.
We ran a network with large embedding lookup tables:
After upgrading TF version to 1.0.1, GPU usage dropped in from 60% to 30%.
Training time went up in 50%-200% (depends on how big is the embedding lookup table).
This is the commit that caused the performance degradation:
f9f56f9
The handling of unique indexes is very slow and does not run in parallel with others operations.
Please note the big unique blocks in the middle.
Here is a work around (not handling unique indexes ):
Thanks,
Erez
The text was updated successfully, but these errors were encountered: