-
Notifications
You must be signed in to change notification settings - Fork 74.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replacement for experimental_run_tf_function after removal from tf.keras.Model.compile #35138
Comments
@tgaddair I notice your PR is already merged. Do we still need to keep this open? Thanks! |
Hey @jvishnuvardhan, thanks for the response. That PR was only to pin our integration tests to an older version of |
cc @martinwicke, @alsrgv was telling me you might have some thoughts on this as well. Thanks. |
@tanzhenyu or @robieta would know more details. |
Hey @tanzhenyu and @robieta, any thoughts on this? Currently we're unable to support TensorFlow 2.1 effectively without this. But hopefully there's a workaround. |
@tgaddair In my case, |
@tgaddair I am currently using |
@nbro I think you misunderstood me. I'm saying that this functionality will be dropped in TF 2.2, not 2.1. In the original issue, I assumed it would be in effect in 2.1 because the change was made before the 2.1 release, but the change was never merged into the 2.1 branch. Does that make sense? See this discussion for more details: #36398 |
@tgaddair Yes, now it makes sense. |
2.2 release is on its way, I guess the solution from #37765 (edit: actually #36398) will work. Some analysis (still using 2.1): IMO the underlying problem seems to be in the duality of the methods used.
Having multiple seemingly equivalent ways of doing the same thing always complicates (or even breaks) things. But to move forward: (e.g.) Horovod has a Question: Why is there a @tgaddair Correct me if I'm mistaken but I think a good way to solve this is using the plain Optimizer (not the Horovod DistributedOptimizer) and tell TensorFlow to use Horovods DistributedGradientTape. So proposal: Short-term: Allow users to pass in an optional GradientTape to model.compile or make the training loop query the optimizer to create one. Long-term: Clarify why there are at least 2 ways to compute the gradients and decide for one to use. Support that properly. Obviously my call would be on GradientTape to split responsibilities properly. |
I'm not familiar with all, the issues here. But, have you seen that That lets you affect how the model calculates the gradients. Does that help this use case? |
Not really. Still to high-level. The goal is to only replace the way TF gathers gradients. Not change the whole training algorithm |
Hey @Flamefire, following #36398 Horovod will work correctly when calling But I agree that there is an API problem where we have two ways of doing the same thing that needs to be addressed. One solution that's been proposed has been to add an The other solution is, as you suggest, to use As to why |
I think you misunderstood some parts. optimizer.get_gradients exists in TF. It gets used in 2.1 when
Hence my proposal to unify the whole training stuff in TF so there is exactly one way and place things are done. This can then be extended to provide customization points. Example: Provide a custom GradientTape which would eliminate the need for the DistributedOptimizer. I don't think Optimizer should have a method for gathering gradients. An optimizer is a thing like SGD, Adam, ... which defines how gradients are applied. Also the LossScalingOptimizer makes sense. But the DistributedOptimizer is a hack IMO. |
Hey @Flamefire, I think we're on the same page here. Essentially, these different ways of doing the same thing need to be unified. The only question is whether it should be done via a custom gradient tape or custom hooks into the optimizer. Or put differently, how much of the training loop should be managed as internals of the optimizer. I don't have a strong preference either way, so long as TensorFlow chooses to be consistent going forward (which has historically been an issue, as you've pointed out with the three different ways of doing the same thing). I like keeping DistributedOptimizer because it doesn't require Horovod users to relearn the API for TensorFlow 2, but that's secondary to consistency and consolidation. |
What is the actual solution for this? I see the TF2 Keras example in the docs still uses |
Hey @relativeflux, the TF2 Keras example uses |
@tgaddair Excellent, that's good to know. |
@tgaddair, |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
It looks like
experimental_run_tf_function
was removed fromtf.keras.Model.compile
in this commit a few days ago: c73c99c#diff-de9b96ac2d81503324cbbbe21732031fR1159In Horovod, this flag / graph mode is necessary in order for
Optimizer.get_gradients()
to be called, which aggregates gradients across workers. Since this flag has been removed, distributed training in Horovod withtf.keras
is not working in our nightly builds.Is there a workaround to achieve the same behavior with the latest changes on master?
Note that we cannot perform the allreduce aggregation in
apply_gradients
due to interactions with gradient clipping and loss scaling (see horovod/horovod#1347).The text was updated successfully, but these errors were encountered: