Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning Rate fine-tuning #24

Closed
MeTaNoV opened this issue Apr 13, 2018 · 11 comments
Closed

Learning Rate fine-tuning #24

MeTaNoV opened this issue Apr 13, 2018 · 11 comments

Comments

@MeTaNoV
Copy link

MeTaNoV commented Apr 13, 2018

Hello,

I would like to experiment with TF Hub to retrain an image classifier.
The retrain example is a good starting point for that purpose.
In the example, you have the ability to fix a learning rate for the final layer.
Also, using hub.Module(..., trainable=True), you can also let the pre-trained weights to be updated.
My question is: which learning rate will be applied in that case (inherited from the one specified on the final layer?), and how to change it if possible, and use a different one from the one in the final layer.

Thanks in advance!

@arnoegw
Copy link
Contributor

arnoegw commented Apr 13, 2018 via email

@moono
Copy link

moono commented May 18, 2018

Hello, @arnoegw,

I have question about usage of REGULARIZATION_LOSSES.
Is it like following? Am I on right track?

...

module = hub.Module('...', trainable=True, tags={'train'})
start_from = module(inputs)
logits = tf.layers.dense(start_from, units=n_output_class, activation=None)
...

loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss += tf.add_n(reg_losses)

@arnoegw
Copy link
Contributor

arnoegw commented May 18, 2018

Hi @moono

Yup, this looks right (for training -- be sure to not set tags={'train'} during eval or inference).

TensorFlow offers some syntactic sugar for getting the regularization losses:
tf.losses.get_regularization_losses() for the list,
tf.losses.get_regularization_loss() for its sum.

@moono
Copy link

moono commented May 18, 2018

@arnoegw

I was seeing much higher train accuracy than evaluate accuracy.
But as you mentioned, removing tags={"train"} fixed the issue.
Thank you so much :)

@alabatie
Copy link

Hello @arnoegw and @moono,

I am a colleague of @MeTaNoV. As suggested by his question, we are currently trying to fine-tune a pretrained inception-v3 from TF Hub on our specific classification task. Our first goal (that we haven't yet achieved) is simply to reproduce the results previously obtained with the Caffe framework.

Following your response, we implemented a train graph that instantiates the TF module with hub.Module(module_spec, trainable=True, tags={"train"}) and a test graph that instantiates the TF module with hub.Module(module_spec). As in our Caffe implementation, we reduce the learning rate for the convolutional layers by a factor 10 compared to the final classification layer, using the following trick before the classification layer:cnn_output_tensor = 1/10. * cnn_output_tensor + (1-1/10.0) * tf.stop_gradient(cnn_output_tensor)

An additional important problem was related to the batch normalization layers. In order to work correctly at test time, the moving averages for the batch means and variances need to be updated during training. It seems that these updates are not done by default, which requires to either perform manually the update_op or to include it in a control dependency. Here is what we implemented to automatically perform the updates:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
updates = tf.group(*update_ops)
cross_entropy = control_flow_ops.with_dependencies([updates], cross_entropy)

Even with this implementation, we still don't manage to reproduce our previous results with Caffe.

On today I implemented to fetch variables from batch normalization layers, and to write their histograms in summaries for tensorboard visualization. The visualization shows that moving averages are indeed updated during training, but it also shows that beta variables seem to be fixed throughout training.

I understand that gamma variables are not present since they are redundant with the next convolutional layers in case of ReLU activation. However I would expect beta variables to be very important before a ReLU activation. And I would expect that the normalization effect of batch normalization layers combined with non-trainable beta variables is very detrimental (from our tests, it seems we loose ~4% in our final top-1 accuracy). Is this analysis correct ? Would you have a fix for this ?

Thank you very much in advance.
Antoine

@arnoegw
Copy link
Contributor

arnoegw commented May 25, 2018

Hi Antoine,

what you describe is a general TensorFlow subtlety about running UPDATE_OPS. As far as I know, it's all the same whether they come out of a TensorFlow Hub module, or directly from Python code using batch normalization.

Usually, training is done with a train_op that combines the gradient updates from the optimizer with the elements of the UPDATE_OPS collection. The helper function tf.contrib.training.create_train_op does that by returning a train_op that is the total_loss with control_dependencies on both the update_ops and the grad_updates.

I recommend to do something similar in your code.

Just putting a control dependency on the loss does not automatically put a control dependency on its gradient; cf. the final example snippet in the API docs for tf.Graph.control_dependencies.

I agree that not running UPDATE_OPS to keep the moving averages of batch norm for inference in sync with the per-batch statistics seen during training (or fine-tuning) will likely cause a serious degradation of quality.

Hope that helps,
Arno

@alabatie
Copy link

Thank you very much Arno for the quick answer.

I don't think there's a problem with our moving average updates, since we can now visualize these variables evolving during the training.

What concerned me was beta variables that didn't seem to be updated. However I managed to spot slight variations of beta (probably only slight due to how small we set the learning rate in the module part) in my latest visualizations: https://screenshots.firefox.com/Gc0s298lAiaIFpIP/localhost

This means that these layers are correctly trained. Thus we are still wondering why we can't reproduce the results we obtained with Caffe.

@arnoegw
Copy link
Contributor

arnoegw commented May 29, 2018

Hi Antoine, I'm glad you could clear the UPDATE_OPS issue (If you evaluate cross_entropy at every step for loss reporting, your code will work, albeit not from backprop alone.), and also the training of beta (batch norm's learned output mean).

Are you still seeing a difference to Caffe? That's such a wide question, it's hard for me to answer. REGULARIZATION_LOSSES of the module were already discussed upthread. There might also be differences in regularizing the classifier you put on top (dropout?, weight decay?), data augmentation, the optimizer and its learning rate schedule, Polyak averaging of the model weights, ...

@rsethur
Copy link

rsethur commented Jun 23, 2018

Hello @alabatie can you share your findings, please?
I'm leveraging TF Hub as well and would appreciate your findings.

@himaprasoonpt
Copy link

himaprasoonpt commented Aug 7, 2019

If I am using a hub module as follows
module = hub.Module('...', trainable=True, tags={'train'})
module_out = module(input)
layer2 = somelayer(moduleout)
Defines losses and optimizer
After training is complete, when I run layer2(final layer) in infer mode, should I change the module tag? If yes how can I do that? Should I be using some sort of placeholder to switch tag? The batch norm mode has to be changed right?
@arnoegw

@arnoegw
Copy link
Contributor

arnoegw commented Aug 7, 2019

Hello @himaprasoonpt, please see this StackOverflow answer answer. (In short: Solutions differ for TF1 and TF2. In TF1, you'd need to checkpoint weights and restore into a new graph built with switched tags.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants