-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved optimization and implementation of adaptive learning rate algorithms (Adagrad, RMSProp) #106
Conversation
…accuupdate call in affine-trans-layer, fixed initialization of temporary ada variables in uni-lstm
I've also noticed that affine-trans-layer didn't do gradient clipping yet, which resulted in very large accumulators and eventually nans in the affine-trans-layer. I suspect this can also be a problem with SGD. |
I've added RMSProp, which addresses Adagrad's weakness of a too aggressive learning rate decay. A good default learning rate for RMSProp is 0.001. |
Currently testing this on Haitian data (sort of our default small-ish test set from lorelei low resource language project) with the suggested learn_rate 0.01 and though it builds and runs OK, looks a little bit unstable*. For comparison, we are trying the same thing but with our default settings - probably as close to an apples-apples comparison as we can get given the different optimization algorithms learn-rate 4e-5 output so far*
|
Thanks for trying it out! And your output doesn't look quite as it should... (1) What optimization are you using, Rmsprop or Adagrad? (I'd recommend Rmsprop) (2) How large is the data? From the timings I assume its fairly small, I've been mainly testing with 120h+ corpora. I suspect, if its a small corpora, you would reset the accumulators much quicker than I would do. Note that resetting them between epochs is not how you would typically do Adagrad/Rmsprop and I'd say that saving the accumulators between epochs should be implemented to make the implementation complete. That should also help in instances where you don't have a lot of training data. Would saving the accumulator in the intermediate model files be ok, codewise? (3) The gradient statistics from the end of the epoch log files could be helpful to troubleshoot this, could you post them? Thanks! |
I've looked into the parameters I've used for RMSProp on our German corpus (recommended learning rate values are different for RMSProp and Adagrad) and this is what I have: learn_rate=0.001 Let me know if that already helps to converge on your data. |
(1) Was using Adagrad, having not caught up on the rest of this thread where Rmsprop is added later ('currently only Adagrad...') |
Just curious: what size is your German data? And among the different algorithms, which should perform better? (I guess in terms of WER) - the intuition is that RMSProp is better than Adagrad is better than Gradient Descent. And someone here just said "try Adam" which I guess is even newer and better. :) |
But first, RMSProp looks great. In general, my testing of this pull request is complete; it builds, it runs, no regression errors as far as I can tell.
We'll wait a little longer to see relative WERs across the different algorithms/strategies, but I'm thinking this pull request merge-worthy. |
Eric, what is the baseline for this, please? |
Still computing using 'best guess' learning rate optimization parameters for 3 setups on the same data: SGD, Adagrad, RMSProp. Soon! |
Best WERs comparison for our Haitian test set
RMSProp:
Adagrad:
See attached: training statistics (train/validation accuracy by epoch) |
interesting - the best LM weights seem very different, as do the insertion/ deletions.
how are the training times? comparable?
On Jan 31, 2017, at 5:33 PM, riebling ***@***.***> wrote:
Best WERs comparison for our Haitian test set
SGD:
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | NCE |
score_p-0.5_10/dev10h.ctm.filt.sys:251:| Sum/Avg |21530 95893 | 61.0 30.5 8.5 8.5 47.5 32.5 | -0.999 |
RMSProp:
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | NCE |
score_p0.5_7/dev10h.ctm.filt.sys:251: | Sum/Avg |21530 95893 | 58.2 28.6 13.2 5.3 47.0 32.5 | -0.908 |
Adagrad:
score_p1.0_9/dev10h.ctm.filt.sys: | Sum/Avg |21530 95893 | 49.4 35.2 15.3 4.9 55.5 33.6 | -0.529 |
TrainingAlgorithms.xlsx <https://github.com/srvk/eesen/files/743368/TrainingAlgorithms.xlsx>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#106 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEnA8URLAd0k19a-AWx-YYanSYficCTvks5rX7aigaJpZM4KoyXG>.
Florian Metze <http://www.cs.cmu.edu/directory/florian-metze>
Associate Research Professor
Carnegie Mellon University
|
(what a great thing to have the training scripts print this out!) |
Interesting, thanks for the evaluation! My observations are, based on the spreadsheet:
Since convergence is faster in the beginning with RMSProp, you can usually set "halving-after-epoch" and "min-iters / max-iters" to much lower values compared to SGD. I see 25 epochs for SGD and RMSProp, so I guess min-iters was set to 25? Hopefully, I find some time soon to implement saving the accumulator state between epcohs. That should make "halving-after-epoch" unnecessary. For the differences in insertion / deletions, my guess would be that could also fluctuate somewhat with the same optimizer if you do multiple runs. |
I've just added state saving for RMSProp. This is backwards compatible, i.e. you can still load older model files. Also, the accumulators are only saved if you train with RMSProp or Adagrad. Could you run your evaluation again with my newer version? I can also send you some results on our data next week with the version I've just pushed. Seems that so far we have better results with RMSProp than with SGD+Momentum on our data, but I haven't compared it with learning rate annealing yet. I can also report that setting the RMSProp gradient clipping much smaller than for SGD isn't necessary anymore, it worked just fine with 40. That this parameter was problematic in the beginning was probably due to the exploding accumulators in the affine layers, because they didn't do the clipping at all. Also, I haven't tested Adagrad yet with the state saving. |
Of course we can test this out, sorry for not replying sooner (hectic, bla bla). I have some time on an outside compute cluster (ours is currently down) to give this a try and will let you know. |
Here please find a chart with training Accuracy graphs for baseline, Adagrad, and Rmsprop, from the latest code in this branch. I'm not sure why Adagrad took so long to start moving. |
Thanks and dito sorry for not replying sooner! I'll also send you an off-list reply later this week, with results on our German corpus, where we see gains in convergence speed and also WER with RMSProp. I've mostly used RMSProp for my experiments now though, with no decreasing (initial) learning rate. From the data you've send me I guess that the initial learning rate is too high for Adagrad, but if it proves to be too tricky to use, I'd suggest to disable it in this pull request. |
Looks pretty good to me |
This adds the possibility to train with an adaptive learning rate algorithm. Currently only Adagrad is implemented, since it's the simplest of them, but RMSProp is very similar and can easily be added on top of this. In the current version, a switch is provided to train-ctc-parallel:
po.Register("opt-algorithm", &opt, "Optimization algorithm (SGD|Adagrad)");
so that "--opt-algorithm Adagrad" will run the training with Adagrad.
Even though Adagrad has a very simple formula, a lot of changes were necessary:
(+) I added some missing cudamatrix and cudavector functions to make this possible, but similar functions need to be added for the CPU-only version. With the exception of the sqrt function, these are from Kaldi.
(+) Since each layer has its own update function, all layer types needed to be updated with the Adagrad update code. To keep things manageable though, helper functions have been added to the base class trainable_layer so that all layer types share the update code. But for every variable this is called separately, as with SGD. One accumulator and one temporary variable is added for every gradient matrix and vector in every layer. These are only resized / initialized if Adagrad is chosen as optimization algorithm though and shouldn't take extra space on the GPU if run with SGD.
I am currently testing with the following parameters (note that Adagrad needs a higher initial learning rate and is run without momentum and I found it helpful to clip gradients more aggressively):
learn_rate 0.01
momentum 0.0
maxgrad 4.0
This seems to work for me, but needs more testing + a side by side WER comparison with SGD. Would also be interesting to see if the final result is somewhat more robust to the choice of the initial learning rate.
Limitations of the current state of the implementation:
(+) A short coming is that the accumulator is not saved between epochs, resulting in a temporary drop of the accuracy at the beginning of an epoch until the accumulator is re-estimated. Also this makes the learning rate annealing schedule still necessary.
(+) Adagrad is GPU-only at the moment, but since a switch is provided to choose between SGD+momentum and Adagrad, it can still be configured to run on CPU with SGD.
(+) The switch is currently only added to train-ctc-parallel and I've only tested it with the bilstm layers so far.
Would be really cool if someone could try this out! Hopefully all the necessary changes are in this PR. If not, let me know.