Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved optimization and implementation of adaptive learning rate algorithms (Adagrad, RMSProp) #106

Merged
merged 18 commits into from
Jun 24, 2017

Conversation

bmilde
Copy link

@bmilde bmilde commented Nov 3, 2016

This adds the possibility to train with an adaptive learning rate algorithm. Currently only Adagrad is implemented, since it's the simplest of them, but RMSProp is very similar and can easily be added on top of this. In the current version, a switch is provided to train-ctc-parallel:

po.Register("opt-algorithm", &opt, "Optimization algorithm (SGD|Adagrad)");

so that "--opt-algorithm Adagrad" will run the training with Adagrad.

Even though Adagrad has a very simple formula, a lot of changes were necessary:

(+) I added some missing cudamatrix and cudavector functions to make this possible, but similar functions need to be added for the CPU-only version. With the exception of the sqrt function, these are from Kaldi.
(+) Since each layer has its own update function, all layer types needed to be updated with the Adagrad update code. To keep things manageable though, helper functions have been added to the base class trainable_layer so that all layer types share the update code. But for every variable this is called separately, as with SGD. One accumulator and one temporary variable is added for every gradient matrix and vector in every layer. These are only resized / initialized if Adagrad is chosen as optimization algorithm though and shouldn't take extra space on the GPU if run with SGD.

I am currently testing with the following parameters (note that Adagrad needs a higher initial learning rate and is run without momentum and I found it helpful to clip gradients more aggressively):

learn_rate 0.01
momentum 0.0
maxgrad 4.0

This seems to work for me, but needs more testing + a side by side WER comparison with SGD. Would also be interesting to see if the final result is somewhat more robust to the choice of the initial learning rate.

Limitations of the current state of the implementation:
(+) A short coming is that the accumulator is not saved between epochs, resulting in a temporary drop of the accuracy at the beginning of an epoch until the accumulator is re-estimated. Also this makes the learning rate annealing schedule still necessary.
(+) Adagrad is GPU-only at the moment, but since a switch is provided to choose between SGD+momentum and Adagrad, it can still be configured to run on CPU with SGD.
(+) The switch is currently only added to train-ctc-parallel and I've only tested it with the bilstm layers so far.

Would be really cool if someone could try this out! Hopefully all the necessary changes are in this PR. If not, let me know.

@bmilde
Copy link
Author

bmilde commented Nov 4, 2016

I've also noticed that affine-trans-layer didn't do gradient clipping yet, which resulted in very large accumulators and eventually nans in the affine-trans-layer. I suspect this can also be a problem with SGD.

@bmilde
Copy link
Author

bmilde commented Nov 10, 2016

I've added RMSProp, which addresses Adagrad's weakness of a too aggressive learning rate decay. A good default learning rate for RMSProp is 0.001.

@bmilde bmilde changed the title Implementation of an adaptive learning rate algorithm (Adagrad) Improved optimization and implementation of adaptive learning rate algorithms (Adagrad, RMSProp) Nov 10, 2016
@riebling
Copy link
Contributor

Currently testing this on Haitian data (sort of our default small-ish test set from lorelei low resource language project) with the suggested

learn_rate 0.01
momentum 0.0
maxgrad 4.0

and though it builds and runs OK, looks a little bit unstable*. For comparison, we are trying the same thing but with our default settings - probably as close to an apples-apples comparison as we can get given the different optimization algorithms

learn-rate 4e-5
momentum 0.9
maxgrad 4.0

output so far*

EPOCH 7 RUNNING ... ENDS [2017-Jan-23 20:59:34]: lrate 0.01, TRAIN ACCURACY 17.2297%, VALID ACCURACY 40.6902%
EPOCH 8 RUNNING ... ENDS [2017-Jan-23 22:10:28]: lrate 0.01, TRAIN ACCURACY 29.5814%, VALID ACCURACY 45.2823%
EPOCH 9 RUNNING ... ENDS [2017-Jan-23 23:21:40]: lrate 0.01, TRAIN ACCURACY 18.8273%, VALID ACCURACY 39.6867%
EPOCH 10 RUNNING ... ENDS [2017-Jan-24 00:33:00]: lrate 0.01, TRAIN ACCURACY 28.4833%, VALID ACCURACY 45.0673%
EPOCH 11 RUNNING ... ENDS [2017-Jan-24 01:44:48]: lrate 0.01, TRAIN ACCURACY 2.0878%, VALID ACCURACY 24.4542%
EPOCH 12 RUNNING ... ENDS [2017-Jan-24 02:56:36]: lrate 0.01, TRAIN ACCURACY 33.4453%, VALID ACCURACY 46.5732%
EPOCH 13 RUNNING ... ENDS [2017-Jan-24 04:07:22]: lrate 0.01, TRAIN ACCURACY 39.8068%, VALID ACCURACY 46.8349%
EPOCH 14 RUNNING ... ENDS [2017-Jan-24 05:17:46]: lrate 0.01, TRAIN ACCURACY 25.8570%, VALID ACCURACY 43.3929%
EPOCH 15 RUNNING ... ENDS [2017-Jan-24 06:28:16]: lrate 0.01, TRAIN ACCURACY 38.1701%, VALID ACCURACY 46.0704%
EPOCH 16 RUNNING ... ENDS [2017-Jan-24 07:38:41]: lrate 0.01, TRAIN ACCURACY 42.3554%, VALID ACCURACY 48.6366%
EPOCH 17 RUNNING ... ENDS [2017-Jan-24 08:49:04]: lrate 0.01, TRAIN ACCURACY 42.5339%, VALID ACCURACY 49.3602%
EPOCH 18 RUNNING ... ENDS [2017-Jan-24 10:00:03]: lrate 0.01, TRAIN ACCURACY 16.2077%, VALID ACCURACY 34.0114%
EPOCH 19 RUNNING ... ENDS [2017-Jan-24 11:09:53]: lrate 0.01, TRAIN ACCURACY 34.8846%, VALID ACCURACY 43.9012%
EPOCH 20 RUNNING ...

@bmilde
Copy link
Author

bmilde commented Jan 25, 2017

Thanks for trying it out! And your output doesn't look quite as it should...

(1) What optimization are you using, Rmsprop or Adagrad? (I'd recommend Rmsprop)

(2) How large is the data? From the timings I assume its fairly small, I've been mainly testing with 120h+ corpora. I suspect, if its a small corpora, you would reset the accumulators much quicker than I would do. Note that resetting them between epochs is not how you would typically do Adagrad/Rmsprop and I'd say that saving the accumulators between epochs should be implemented to make the implementation complete. That should also help in instances where you don't have a lot of training data. Would saving the accumulator in the intermediate model files be ok, codewise?

(3) The gradient statistics from the end of the epoch log files could be helpful to troubleshoot this, could you post them? Thanks!

@bmilde
Copy link
Author

bmilde commented Jan 26, 2017

I've looked into the parameters I've used for RMSProp on our German corpus (recommended learning rate values are different for RMSProp and Adagrad) and this is what I have:

learn_rate=0.001
momentum=0.0
opt_algorithm=RMSProp
gradient_clipping=5.0
delta=false (System with spliced and subsampled features)

Let me know if that already helps to converge on your data.

@riebling
Copy link
Contributor

(1) Was using Adagrad, having not caught up on the rest of this thread where Rmsprop is added later ('currently only Adagrad...')
(2) Not real big, you're right. Saving the accumulator in the intermediate model files should be ok, codewise, unless the format of intermediate model files is incompatible with being used as a final.mdl
(3) Sure thing! Starting over, will be a while before gradient stats arrive

@riebling
Copy link
Contributor

riebling commented Jan 26, 2017

Just curious: what size is your German data? And among the different algorithms, which should perform better? (I guess in terms of WER) - the intuition is that RMSProp is better than Adagrad is better than Gradient Descent. And someone here just said "try Adam" which I guess is even newer and better. :)

@riebling
Copy link
Contributor

But first, RMSProp looks great. In general, my testing of this pull request is complete; it builds, it runs, no regression errors as far as I can tell.

EPOCH 1 RUNNING ... ENDS [2017-Jan-26 08:44:51]: lrate 0.001, TRAIN ACCURACY 26.4900%, VALID ACCURACY 47.5029%
EPOCH 2 RUNNING ... ENDS [2017-Jan-26 10:42:37]: lrate 0.001, TRAIN ACCURACY 51.0916%, VALID ACCURACY 55.0380%
EPOCH 3 RUNNING ... ENDS [2017-Jan-26 12:39:51]: lrate 0.001, TRAIN ACCURACY 57.4052%, VALID ACCURACY 57.9443%
EPOCH 4 RUNNING ... ENDS [2017-Jan-26 14:39:07]: lrate 0.001, TRAIN ACCURACY 60.9593%, VALID ACCURACY 60.2501%
EPOCH 5 RUNNING ... ENDS [2017-Jan-26 16:44:22]: lrate 0.001, TRAIN ACCURACY 63.4124%, VALID ACCURACY 61.5300%
EPOCH 6 RUNNING ... ENDS [2017-Jan-26 18:42:27]: lrate 0.001, TRAIN ACCURACY 65.3334%, VALID ACCURACY 62.4766%
EPOCH 7 RUNNING ... ENDS [2017-Jan-26 20:40:34]: lrate 0.001, TRAIN ACCURACY 66.8668%, VALID ACCURACY 62.3881%
EPOCH 8 RUNNING ... ENDS [2017-Jan-26 22:41:57]: lrate 0.001, TRAIN ACCURACY 68.1538%, VALID ACCURACY 63.5897%
EPOCH 9 RUNNING ... ENDS [2017-Jan-27 00:39:05]: lrate 0.001, TRAIN ACCURACY 69.2705%, VALID ACCURACY 63.7629%
EPOCH 10 RUNNING ... ENDS [2017-Jan-27 02:36:17]: lrate 0.001, TRAIN ACCURACY 70.2435%, VALID ACCURACY 63.5366%
EPOCH 11 RUNNING ... ENDS [2017-Jan-27 04:34:16]: lrate 0.001, TRAIN ACCURACY 71.1068%, VALID ACCURACY 64.1498%
EPOCH 12 RUNNING ... ENDS [2017-Jan-27 06:31:22]: lrate 0.001, TRAIN ACCURACY 71.8265%, VALID ACCURACY 63.7616%

We'll wait a little longer to see relative WERs across the different algorithms/strategies, but I'm thinking this pull request merge-worthy.

@fmetze
Copy link
Contributor

fmetze commented Jan 27, 2017

Eric, what is the baseline for this, please?

@riebling
Copy link
Contributor

riebling commented Jan 27, 2017

Still computing using 'best guess' learning rate optimization parameters for 3 setups on the same data: SGD, Adagrad, RMSProp. Soon!

@riebling
Copy link
Contributor

riebling commented Jan 31, 2017

Best WERs comparison for our Haitian test set
SGD:

                                       |  SPKR    |  # Snt # Wrd   |Corr    Sub     Del     Ins    Err   S.Err  |   NCE     | 
score_p-0.5_10/dev10h.ctm.filt.sys:251:| Sum/Avg  |21530   95893 | 61.0    30.5     8.5     8.5    47.5   32.5  | -0.999  |

RMSProp:

                                       |  SPKR   |  # Snt # Wrd  |Corr    Sub    Del    Ins    Err  S.Err  |   NCE    | 
score_p0.5_7/dev10h.ctm.filt.sys:251:  | Sum/Avg | 21530  95893 | 58.2   28.6   13.2    5.3   47.0   32.5  | -0.908 |

Adagrad:

score_p2.0_9/dev10h.ctm.filt.sys:   |  SPKR     |  # Snt # Wrd  |  Corr    Sub    Del    Ins    Err  S.Err  |   NCE    | 
score_p1.0_9/dev10h.ctm.filt.sys:   |  Sum/Avg  | 21530  95893  |  49.4   35.2   15.3    4.9   55.5   33.6  |  -0.529  |

See attached: training statistics (train/validation accuracy by epoch)
TrainingAlgorithms.xlsx

@fmetze
Copy link
Contributor

fmetze commented Feb 1, 2017 via email

@riebling
Copy link
Contributor

riebling commented Feb 1, 2017

how are the training times?

(what a great thing to have the training scripts print this out!)
Also interesting:
SGD: 27 hours
RMSProp: 48 hours (JUST fits in the compute window!)
Adagrad: 69 hours
(I don't have a feel for how these numbers compare with other compute clusters)

@bmilde
Copy link
Author

bmilde commented Feb 7, 2017

Interesting, thanks for the evaluation! My observations are, based on the spreadsheet:

  • Adagrads convergence doesn't look stable with your dataset. Initial learning rate might be too high, or the (atypical) resetting of the accumulators on every epoch gets in the way of convergence.

  • Looks like RMSProp is overfitting heavly after the first few epochs. Have you tried the comparison on a bigger dataset? Did you use Dropout?

  • If you train the same number of epochs, SGD will always be faster. If you train with RMSProp, the hope is that you amortize the extra computations by finishing the training in fewer epochs. What was the stopping criterion? Did you lower the initial learning rate for RMSProp, too?

Since convergence is faster in the beginning with RMSProp, you can usually set "halving-after-epoch" and "min-iters / max-iters" to much lower values compared to SGD. I see 25 epochs for SGD and RMSProp, so I guess min-iters was set to 25?

Hopefully, I find some time soon to implement saving the accumulator state between epcohs. That should make "halving-after-epoch" unnecessary.

For the differences in insertion / deletions, my guess would be that could also fluctuate somewhat with the same optimizer if you do multiple runs.

@bmilde
Copy link
Author

bmilde commented Feb 22, 2017

I've just added state saving for RMSProp. This is backwards compatible, i.e. you can still load older model files. Also, the accumulators are only saved if you train with RMSProp or Adagrad. Could you run your evaluation again with my newer version?

I can also send you some results on our data next week with the version I've just pushed. Seems that so far we have better results with RMSProp than with SGD+Momentum on our data, but I haven't compared it with learning rate annealing yet. I can also report that setting the RMSProp gradient clipping much smaller than for SGD isn't necessary anymore, it worked just fine with 40. That this parameter was problematic in the beginning was probably due to the exploding accumulators in the affine layers, because they didn't do the clipping at all. Also, I haven't tested Adagrad yet with the state saving.

@riebling
Copy link
Contributor

riebling commented Mar 8, 2017

Of course we can test this out, sorry for not replying sooner (hectic, bla bla). I have some time on an outside compute cluster (ours is currently down) to give this a try and will let you know.

@riebling
Copy link
Contributor

riebling commented Apr 4, 2017

Here please find a chart with training Accuracy graphs for baseline, Adagrad, and Rmsprop, from the latest code in this branch. I'm not sure why Adagrad took so long to start moving.
AlgorithmComparisonHaitian.xlsx

@bmilde
Copy link
Author

bmilde commented Apr 4, 2017

Thanks and dito sorry for not replying sooner! I'll also send you an off-list reply later this week, with results on our German corpus, where we see gains in convergence speed and also WER with RMSProp. I've mostly used RMSProp for my experiments now though, with no decreasing (initial) learning rate. From the data you've send me I guess that the initial learning rate is too high for Adagrad, but if it proves to be too tricky to use, I'd suggest to disable it in this pull request.

@riebling
Copy link
Contributor

Looks pretty good to me
AlgorithmComparisonTedlium.xlsx

@fmetze fmetze merged commit d776976 into srvk:master Jun 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants