-
Notifications
You must be signed in to change notification settings - Fork 74.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Momentum SGD is very slow with large embedding layer #14075
Comments
I'm assuming you mean that AdamOptimizer is slower than MomentumOptimizer; Adam applies momentum to the whole variable, Momentum only applies it to slices affected. Does tf.contrib.opt.LazyAdamOptimizer work for you? |
@allenlavoie sorry, I mean AdamOptimizer is faster than MomentumOptimizer.In my case, |
Can you post code to reproduce? It sounds like this is not a sparsity issue, since MomentumOptimizer handles sparse updates more quickly than Adam (in theory). Maybe a device placement issue? |
And tf.contrib.opt.LazyAdamOptimizer can speed up training speed, It really help. |
I'd start with log_device_placement; maybe sparse Momentum updates don't have a GPU kernel (or don't have one for the dtype you're using)? |
Oh, right, Adam's sparse updates are in terms of other ops rather than being fused. Looks like none of the other optimizers have GPU kernels implemented for the sparse updates (they do have GPU kernels for dense updates). One approach would be to re-implement MomentumOptimizer's sparse updates in terms of the same ops Adam is using rather than using the custom fused kernel. May be difficult without also using Adam's dense update behavior (which is more correct, but would upset lots of people). Or it may be possible to implement the sparse fused GPU kernel efficiently using CUB. Or you could make a simple copy of MomentumOptimizer which always does dense updates, since those will run on the GPU and be faster in your case. This is more of a hack, but certainly contributions welcome for either of the other options. |
Thanks @allenlavoie ~ |
Hi, Have you checked the new tf.keras.optimizers.experimental.SGD, which can be implemented with the large embeddings. Thanks! |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
Imported from GitHub PR openxla/xla#14075 This PR adds pattern `pow(A, 0.5) => sqrt(A), A >= 0` Validation - the following checks are valid both before and after simplification.: ```c If x == 0 the result is 0 If x > 0 the result > 0 If x is inf the result is inf If x is nan the result is nan ``` Copybara import of the project: -- 214b16a491f8f257f2f15576e5e20d512696592d by Alexander Pivovarov <pivovaa@amazon.com>: Add pow(A, 0.5) => sqrt(A), A >= 0 to algsimp Merging this change closes #14075 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#14075 from apivovarov:powhalf_to_sqrt 214b16a491f8f257f2f15576e5e20d512696592d PiperOrigin-RevId: 646070837
Imported from GitHub PR openxla/xla#14075 This PR adds pattern `pow(A, 0.5) => sqrt(A), A >= 0` Validation - the following checks are valid both before and after simplification.: ```c If x == 0 the result is 0 If x > 0 the result > 0 If x is inf the result is inf If x is nan the result is nan ``` Copybara import of the project: -- 214b16a491f8f257f2f15576e5e20d512696592d by Alexander Pivovarov <pivovaa@amazon.com>: Add pow(A, 0.5) => sqrt(A), A >= 0 to algsimp Merging this change closes #14075 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#14075 from apivovarov:powhalf_to_sqrt 214b16a491f8f257f2f15576e5e20d512696592d PiperOrigin-RevId: 646070837
Imported from GitHub PR openxla/xla#14075 This PR adds pattern `pow(A, 0.5) => sqrt(A), A >= 0` Validation - the following checks are valid both before and after simplification.: ```c If x == 0 the result is 0 If x > 0 the result > 0 If x is inf the result is inf If x is nan the result is nan ``` Copybara import of the project: -- 214b16a491f8f257f2f15576e5e20d512696592d by Alexander Pivovarov <pivovaa@amazon.com>: Add pow(A, 0.5) => sqrt(A), A >= 0 to algsimp Merging this change closes #14075 PiperOrigin-RevId: 646113258
Hi, I have an (256 col* 500000row) embedding layer in my model and I found that Momentum SGD is 10x slower than adam optimizer. When I printed the global variables using API :tf.global_variables(), I found a large Momentum variable created by optimizer, does that means Momentum SGD have not supported sparse update yet?
The text was updated successfully, but these errors were encountered: