-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Learning Rate fine-tuning #24
Comments
Hi Pascal,
I'm afraid you will have to move past the retrain.py example, because it
really is designed around a frozen image module with potentially cached
bottleneck values. Even if not cached, the way they are passed into the
Session.run() call that does the training prevents backprop through the
module.
There is no publicly available example, but I can offer some general advice:
- For the training graph, use hub.Module(..., trainable=True,
tags={"train"}) to get the graph version that operates batch norm in
training mode. (In retrain.py, this would conflict with the use of
continuous eval in the training graph.) Also, trainable=True brings in
REGULARIZATION_LOSSES from the hidden layers (if any); use those when you
see overfitting.
- Use data augmentation. For that, retrain.py is a workable starting point.
- Fine-tuning the whole network benefits from more sophisticated approaches
than plain SGD, esp. the use of momentum and learning rate decay. As a rule
of thumb, I'd recommend to start form the training regime in the
architecture's original paper (referenced from the module documentation),
with the initial learning rate cut by 10. Be aware that many uses of
RMSProp for image models set epsilon=1.0 (which is huge), and attenuates
the AdaGrad-style scaling.
- For performance, consider using a GPU, and/or multiple machines. If you
want to try in-graph replication for multiple GPUs, be sure to reuse one
Module object across the different towers, in order to share variables.
Also, for performance, consider training only the upper layers of the
module (e.g., by filtering trainable variables according to scope names).
Happy coding/training!
Arno
…On 13 April 2018 at 17:24, Pascal Gula ***@***.***> wrote:
Hello,
I would like to experiment with TF Hub to retrain an image classifier.
The retrain example is a good starting point for that purpose.
In the example, you have the ability to fix a learning rate for the final
layer.
Also, using hub.Module(..., trainable=True), you can also let the
pre-trained weights to be updated.
My question is: which learning rate will be applied in that case, and how
to change it if possible?
Thanks in advance!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#24>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AjkBVQSbLN4jZknpwOY2bsBxBr_McYxGks5toMNCgaJpZM4TTo_d>
.
--
Google Switzerland GmbH
|
Hello, @arnoegw, I have question about usage of REGULARIZATION_LOSSES.
|
Hi @moono Yup, this looks right (for training -- be sure to not set tags={'train'} during eval or inference). TensorFlow offers some syntactic sugar for getting the regularization losses: |
I was seeing much higher train accuracy than evaluate accuracy. |
I am a colleague of @MeTaNoV. As suggested by his question, we are currently trying to fine-tune a pretrained inception-v3 from TF Hub on our specific classification task. Our first goal (that we haven't yet achieved) is simply to reproduce the results previously obtained with the Caffe framework. Following your response, we implemented a train graph that instantiates the TF module with An additional important problem was related to the batch normalization layers. In order to work correctly at test time, the moving averages for the batch means and variances need to be updated during training. It seems that these updates are not done by default, which requires to either perform manually the update_op or to include it in a control dependency. Here is what we implemented to automatically perform the updates: Even with this implementation, we still don't manage to reproduce our previous results with Caffe. On today I implemented to fetch variables from batch normalization layers, and to write their histograms in summaries for tensorboard visualization. The visualization shows that moving averages are indeed updated during training, but it also shows that beta variables seem to be fixed throughout training. I understand that gamma variables are not present since they are redundant with the next convolutional layers in case of ReLU activation. However I would expect beta variables to be very important before a ReLU activation. And I would expect that the normalization effect of batch normalization layers combined with non-trainable beta variables is very detrimental (from our tests, it seems we loose ~4% in our final top-1 accuracy). Is this analysis correct ? Would you have a fix for this ? Thank you very much in advance. |
Hi Antoine, what you describe is a general TensorFlow subtlety about running UPDATE_OPS. As far as I know, it's all the same whether they come out of a TensorFlow Hub module, or directly from Python code using batch normalization. Usually, training is done with a train_op that combines the gradient updates from the optimizer with the elements of the UPDATE_OPS collection. The helper function tf.contrib.training.create_train_op does that by returning a I recommend to do something similar in your code. Just putting a control dependency on the loss does not automatically put a control dependency on its gradient; cf. the final example snippet in the API docs for tf.Graph.control_dependencies. I agree that not running UPDATE_OPS to keep the moving averages of batch norm for inference in sync with the per-batch statistics seen during training (or fine-tuning) will likely cause a serious degradation of quality. Hope that helps, |
Thank you very much Arno for the quick answer. I don't think there's a problem with our moving average updates, since we can now visualize these variables evolving during the training. What concerned me was beta variables that didn't seem to be updated. However I managed to spot slight variations of beta (probably only slight due to how small we set the learning rate in the module part) in my latest visualizations: https://screenshots.firefox.com/Gc0s298lAiaIFpIP/localhost This means that these layers are correctly trained. Thus we are still wondering why we can't reproduce the results we obtained with Caffe. |
Hi Antoine, I'm glad you could clear the UPDATE_OPS issue (If you evaluate Are you still seeing a difference to Caffe? That's such a wide question, it's hard for me to answer. REGULARIZATION_LOSSES of the module were already discussed upthread. There might also be differences in regularizing the classifier you put on top (dropout?, weight decay?), data augmentation, the optimizer and its learning rate schedule, Polyak averaging of the model weights, ... |
Hello @alabatie can you share your findings, please? |
If I am using a hub module as follows |
Hello @himaprasoonpt, please see this StackOverflow answer answer. (In short: Solutions differ for TF1 and TF2. In TF1, you'd need to checkpoint weights and restore into a new graph built with switched tags.) |
Hello,
I would like to experiment with TF Hub to retrain an image classifier.
The retrain example is a good starting point for that purpose.
In the example, you have the ability to fix a learning rate for the final layer.
Also, using
hub.Module(..., trainable=True)
, you can also let the pre-trained weights to be updated.My question is: which learning rate will be applied in that case (inherited from the one specified on the final layer?), and how to change it if possible, and use a different one from the one in the final layer.
Thanks in advance!
The text was updated successfully, but these errors were encountered: