Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Momentum SGD is very slow with large embedding layer #14075

Closed
TomorrowIsAnOtherDay opened this issue Oct 29, 2017 · 12 comments · Fixed by #70281
Closed

Momentum SGD is very slow with large embedding layer #14075

TomorrowIsAnOtherDay opened this issue Oct 29, 2017 · 12 comments · Fixed by #70281
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author

Comments

@TomorrowIsAnOtherDay
Copy link

Hi, I have an (256 col* 500000row) embedding layer in my model and I found that Momentum SGD is 10x slower than adam optimizer. When I printed the global variables using API :tf.global_variables(), I found a large Momentum variable created by optimizer, does that means Momentum SGD have not supported sparse update yet?

@TomorrowIsAnOtherDay TomorrowIsAnOtherDay changed the title Momentum SGD is very slow with big embedding layer Momentum SGD is very slow with large embedding layer Oct 30, 2017
@allenlavoie
Copy link
Member

I'm assuming you mean that AdamOptimizer is slower than MomentumOptimizer; Adam applies momentum to the whole variable, Momentum only applies it to slices affected. Does tf.contrib.opt.LazyAdamOptimizer work for you?

@TomorrowIsAnOtherDay
Copy link
Author

@allenlavoie sorry, I mean AdamOptimizer is faster than MomentumOptimizer.In my case,
AdamOptimizer can run 11.58 batch per second, while MomentumOptimizer can only run 5 batch per second.I run TF 1.3.1 in K40 GPU.

@allenlavoie
Copy link
Member

Can you post code to reproduce? It sounds like this is not a sparsity issue, since MomentumOptimizer handles sparse updates more quickly than Adam (in theory). Maybe a device placement issue?

@TomorrowIsAnOtherDay
Copy link
Author

TomorrowIsAnOtherDay commented Oct 30, 2017

The code is complex and has a lot preprocessing code(using Dataset API and TF data), I can write a test code for it.Could you give some advice to test whether it is a placement issue? I just changed the code from AdamOptimizer to MomentumOptimizer.The ways how the optimizer placed should be same.
image
image

@TomorrowIsAnOtherDay
Copy link
Author

And tf.contrib.opt.LazyAdamOptimizer can speed up training speed, It really help.

@allenlavoie
Copy link
Member

allenlavoie commented Oct 30, 2017

I'd start with log_device_placement; maybe sparse Momentum updates don't have a GPU kernel (or don't have one for the dtype you're using)?

@TomorrowIsAnOtherDay
Copy link
Author

TomorrowIsAnOtherDay commented Oct 30, 2017

image
It seems sparse Momentum updates don't have a GPU kernel?

@allenlavoie
Copy link
Member

Oh, right, Adam's sparse updates are in terms of other ops rather than being fused. Looks like none of the other optimizers have GPU kernels implemented for the sparse updates (they do have GPU kernels for dense updates).

One approach would be to re-implement MomentumOptimizer's sparse updates in terms of the same ops Adam is using rather than using the custom fused kernel. May be difficult without also using Adam's dense update behavior (which is more correct, but would upset lots of people).

Or it may be possible to implement the sparse fused GPU kernel efficiently using CUB.

Or you could make a simple copy of MomentumOptimizer which always does dense updates, since those will run on the GPU and be faster in your case. This is more of a hack, but certainly contributions welcome for either of the other options.

@allenlavoie allenlavoie added the stat:contribution welcome Status - Contributions welcome label Oct 30, 2017
@TomorrowIsAnOtherDay
Copy link
Author

Thanks @allenlavoie ~

@sachinprasadhs sachinprasadhs self-assigned this Nov 23, 2022
@sachinprasadhs sachinprasadhs removed the stat:contribution welcome Status - Contributions welcome label Nov 23, 2022
@sachinprasadhs
Copy link
Contributor

Hi, Have you checked the new tf.keras.optimizers.experimental.SGD, which can be implemented with the large embeddings. Thanks!

@sachinprasadhs sachinprasadhs added the stat:awaiting response Status - Awaiting response from author label Nov 23, 2022
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Nov 30, 2022
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

copybara-service bot pushed a commit that referenced this issue Jun 24, 2024
Imported from GitHub PR openxla/xla#14075

This PR adds pattern `pow(A, 0.5) => sqrt(A), A >= 0`

Validation - the following checks are valid both before and after simplification.:
```c
If x == 0 the result is 0
If x > 0 the result > 0
If x is inf the result is inf
If x is nan the result is nan
```
Copybara import of the project:

--
214b16a491f8f257f2f15576e5e20d512696592d by Alexander Pivovarov <pivovaa@amazon.com>:

Add pow(A, 0.5) => sqrt(A), A >= 0 to algsimp

Merging this change closes #14075

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#14075 from apivovarov:powhalf_to_sqrt 214b16a491f8f257f2f15576e5e20d512696592d
PiperOrigin-RevId: 646070837
copybara-service bot pushed a commit that referenced this issue Jun 24, 2024
Imported from GitHub PR openxla/xla#14075

This PR adds pattern `pow(A, 0.5) => sqrt(A), A >= 0`

Validation - the following checks are valid both before and after simplification.:
```c
If x == 0 the result is 0
If x > 0 the result > 0
If x is inf the result is inf
If x is nan the result is nan
```
Copybara import of the project:

--
214b16a491f8f257f2f15576e5e20d512696592d by Alexander Pivovarov <pivovaa@amazon.com>:

Add pow(A, 0.5) => sqrt(A), A >= 0 to algsimp

Merging this change closes #14075

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#14075 from apivovarov:powhalf_to_sqrt 214b16a491f8f257f2f15576e5e20d512696592d
PiperOrigin-RevId: 646070837
copybara-service bot pushed a commit that referenced this issue Jun 24, 2024
Imported from GitHub PR openxla/xla#14075

This PR adds pattern `pow(A, 0.5) => sqrt(A), A >= 0`

Validation - the following checks are valid both before and after simplification.:
```c
If x == 0 the result is 0
If x > 0 the result > 0
If x is inf the result is inf
If x is nan the result is nan
```
Copybara import of the project:

--
214b16a491f8f257f2f15576e5e20d512696592d by Alexander Pivovarov <pivovaa@amazon.com>:

Add pow(A, 0.5) => sqrt(A), A >= 0 to algsimp

Merging this change closes #14075

PiperOrigin-RevId: 646113258
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants