Improved optimization and implementation of adaptive learning rate algorithms (Adagrad, RMSProp) #106

bmilde · 2016-11-03T19:00:05Z

This adds the possibility to train with an adaptive learning rate algorithm. Currently only Adagrad is implemented, since it's the simplest of them, but RMSProp is very similar and can easily be added on top of this. In the current version, a switch is provided to train-ctc-parallel:

po.Register("opt-algorithm", &opt, "Optimization algorithm (SGD|Adagrad)");

so that "--opt-algorithm Adagrad" will run the training with Adagrad.

Even though Adagrad has a very simple formula, a lot of changes were necessary:

(+) I added some missing cudamatrix and cudavector functions to make this possible, but similar functions need to be added for the CPU-only version. With the exception of the sqrt function, these are from Kaldi.
(+) Since each layer has its own update function, all layer types needed to be updated with the Adagrad update code. To keep things manageable though, helper functions have been added to the base class trainable_layer so that all layer types share the update code. But for every variable this is called separately, as with SGD. One accumulator and one temporary variable is added for every gradient matrix and vector in every layer. These are only resized / initialized if Adagrad is chosen as optimization algorithm though and shouldn't take extra space on the GPU if run with SGD.

I am currently testing with the following parameters (note that Adagrad needs a higher initial learning rate and is run without momentum and I found it helpful to clip gradients more aggressively):

learn_rate 0.01
momentum 0.0
maxgrad 4.0

This seems to work for me, but needs more testing + a side by side WER comparison with SGD. Would also be interesting to see if the final result is somewhat more robust to the choice of the initial learning rate.

Limitations of the current state of the implementation:
(+) A short coming is that the accumulator is not saved between epochs, resulting in a temporary drop of the accuracy at the beginning of an epoch until the accumulator is re-estimated. Also this makes the learning rate annealing schedule still necessary.
(+) Adagrad is GPU-only at the moment, but since a switch is provided to choose between SGD+momentum and Adagrad, it can still be configured to run on CPU with SGD.
(+) The switch is currently only added to train-ctc-parallel and I've only tested it with the bilstm layers so far.

Would be really cool if someone could try this out! Hopefully all the necessary changes are in this PR. If not, let me know.

…e code compiles

…accuupdate call in affine-trans-layer, fixed initialization of temporary ada variables in uni-lstm

bmilde · 2016-11-04T12:31:59Z

I've also noticed that affine-trans-layer didn't do gradient clipping yet, which resulted in very large accumulators and eventually nans in the affine-trans-layer. I suspect this can also be a problem with SGD.

…nsform layers

…t computation

bmilde · 2016-11-10T18:18:01Z

I've added RMSProp, which addresses Adagrad's weakness of a too aggressive learning rate decay. A good default learning rate for RMSProp is 0.001.

riebling · 2017-01-24T20:15:50Z

Currently testing this on Haitian data (sort of our default small-ish test set from lorelei low resource language project) with the suggested

learn_rate 0.01
momentum 0.0
maxgrad 4.0

and though it builds and runs OK, looks a little bit unstable*. For comparison, we are trying the same thing but with our default settings - probably as close to an apples-apples comparison as we can get given the different optimization algorithms

learn-rate 4e-5
momentum 0.9
maxgrad 4.0

output so far*

EPOCH 7 RUNNING ... ENDS [2017-Jan-23 20:59:34]: lrate 0.01, TRAIN ACCURACY 17.2297%, VALID ACCURACY 40.6902%
EPOCH 8 RUNNING ... ENDS [2017-Jan-23 22:10:28]: lrate 0.01, TRAIN ACCURACY 29.5814%, VALID ACCURACY 45.2823%
EPOCH 9 RUNNING ... ENDS [2017-Jan-23 23:21:40]: lrate 0.01, TRAIN ACCURACY 18.8273%, VALID ACCURACY 39.6867%
EPOCH 10 RUNNING ... ENDS [2017-Jan-24 00:33:00]: lrate 0.01, TRAIN ACCURACY 28.4833%, VALID ACCURACY 45.0673%
EPOCH 11 RUNNING ... ENDS [2017-Jan-24 01:44:48]: lrate 0.01, TRAIN ACCURACY 2.0878%, VALID ACCURACY 24.4542%
EPOCH 12 RUNNING ... ENDS [2017-Jan-24 02:56:36]: lrate 0.01, TRAIN ACCURACY 33.4453%, VALID ACCURACY 46.5732%
EPOCH 13 RUNNING ... ENDS [2017-Jan-24 04:07:22]: lrate 0.01, TRAIN ACCURACY 39.8068%, VALID ACCURACY 46.8349%
EPOCH 14 RUNNING ... ENDS [2017-Jan-24 05:17:46]: lrate 0.01, TRAIN ACCURACY 25.8570%, VALID ACCURACY 43.3929%
EPOCH 15 RUNNING ... ENDS [2017-Jan-24 06:28:16]: lrate 0.01, TRAIN ACCURACY 38.1701%, VALID ACCURACY 46.0704%
EPOCH 16 RUNNING ... ENDS [2017-Jan-24 07:38:41]: lrate 0.01, TRAIN ACCURACY 42.3554%, VALID ACCURACY 48.6366%
EPOCH 17 RUNNING ... ENDS [2017-Jan-24 08:49:04]: lrate 0.01, TRAIN ACCURACY 42.5339%, VALID ACCURACY 49.3602%
EPOCH 18 RUNNING ... ENDS [2017-Jan-24 10:00:03]: lrate 0.01, TRAIN ACCURACY 16.2077%, VALID ACCURACY 34.0114%
EPOCH 19 RUNNING ... ENDS [2017-Jan-24 11:09:53]: lrate 0.01, TRAIN ACCURACY 34.8846%, VALID ACCURACY 43.9012%
EPOCH 20 RUNNING ...

bmilde · 2017-01-25T12:21:55Z

Thanks for trying it out! And your output doesn't look quite as it should...

(1) What optimization are you using, Rmsprop or Adagrad? (I'd recommend Rmsprop)

(2) How large is the data? From the timings I assume its fairly small, I've been mainly testing with 120h+ corpora. I suspect, if its a small corpora, you would reset the accumulators much quicker than I would do. Note that resetting them between epochs is not how you would typically do Adagrad/Rmsprop and I'd say that saving the accumulators between epochs should be implemented to make the implementation complete. That should also help in instances where you don't have a lot of training data. Would saving the accumulator in the intermediate model files be ok, codewise?

(3) The gradient statistics from the end of the epoch log files could be helpful to troubleshoot this, could you post them? Thanks!

bmilde · 2017-01-26T13:22:53Z

I've looked into the parameters I've used for RMSProp on our German corpus (recommended learning rate values are different for RMSProp and Adagrad) and this is what I have:

learn_rate=0.001
momentum=0.0
opt_algorithm=RMSProp
gradient_clipping=5.0
delta=false (System with spliced and subsampled features)

Let me know if that already helps to converge on your data.

riebling · 2017-01-26T16:12:07Z

(1) Was using Adagrad, having not caught up on the rest of this thread where Rmsprop is added later ('currently only Adagrad...')
(2) Not real big, you're right. Saving the accumulator in the intermediate model files should be ok, codewise, unless the format of intermediate model files is incompatible with being used as a final.mdl
(3) Sure thing! Starting over, will be a while before gradient stats arrive

riebling · 2017-01-26T20:48:23Z

Just curious: what size is your German data? And among the different algorithms, which should perform better? (I guess in terms of WER) - the intuition is that RMSProp is better than Adagrad is better than Gradient Descent. And someone here just said "try Adam" which I guess is even newer and better. :)

riebling · 2017-01-27T14:40:41Z

But first, RMSProp looks great. In general, my testing of this pull request is complete; it builds, it runs, no regression errors as far as I can tell.

EPOCH 1 RUNNING ... ENDS [2017-Jan-26 08:44:51]: lrate 0.001, TRAIN ACCURACY 26.4900%, VALID ACCURACY 47.5029%
EPOCH 2 RUNNING ... ENDS [2017-Jan-26 10:42:37]: lrate 0.001, TRAIN ACCURACY 51.0916%, VALID ACCURACY 55.0380%
EPOCH 3 RUNNING ... ENDS [2017-Jan-26 12:39:51]: lrate 0.001, TRAIN ACCURACY 57.4052%, VALID ACCURACY 57.9443%
EPOCH 4 RUNNING ... ENDS [2017-Jan-26 14:39:07]: lrate 0.001, TRAIN ACCURACY 60.9593%, VALID ACCURACY 60.2501%
EPOCH 5 RUNNING ... ENDS [2017-Jan-26 16:44:22]: lrate 0.001, TRAIN ACCURACY 63.4124%, VALID ACCURACY 61.5300%
EPOCH 6 RUNNING ... ENDS [2017-Jan-26 18:42:27]: lrate 0.001, TRAIN ACCURACY 65.3334%, VALID ACCURACY 62.4766%
EPOCH 7 RUNNING ... ENDS [2017-Jan-26 20:40:34]: lrate 0.001, TRAIN ACCURACY 66.8668%, VALID ACCURACY 62.3881%
EPOCH 8 RUNNING ... ENDS [2017-Jan-26 22:41:57]: lrate 0.001, TRAIN ACCURACY 68.1538%, VALID ACCURACY 63.5897%
EPOCH 9 RUNNING ... ENDS [2017-Jan-27 00:39:05]: lrate 0.001, TRAIN ACCURACY 69.2705%, VALID ACCURACY 63.7629%
EPOCH 10 RUNNING ... ENDS [2017-Jan-27 02:36:17]: lrate 0.001, TRAIN ACCURACY 70.2435%, VALID ACCURACY 63.5366%
EPOCH 11 RUNNING ... ENDS [2017-Jan-27 04:34:16]: lrate 0.001, TRAIN ACCURACY 71.1068%, VALID ACCURACY 64.1498%
EPOCH 12 RUNNING ... ENDS [2017-Jan-27 06:31:22]: lrate 0.001, TRAIN ACCURACY 71.8265%, VALID ACCURACY 63.7616%

We'll wait a little longer to see relative WERs across the different algorithms/strategies, but I'm thinking this pull request merge-worthy.

fmetze · 2017-01-27T14:43:39Z

Eric, what is the baseline for this, please?

riebling · 2017-01-27T14:44:56Z

Still computing using 'best guess' learning rate optimization parameters for 3 setups on the same data: SGD, Adagrad, RMSProp. Soon!

riebling · 2017-01-31T22:33:05Z

Best WERs comparison for our Haitian test set
SGD:

                                       |  SPKR    |  # Snt # Wrd   |Corr    Sub     Del     Ins    Err   S.Err  |   NCE     | 
score_p-0.5_10/dev10h.ctm.filt.sys:251:| Sum/Avg  |21530   95893 | 61.0    30.5     8.5     8.5    47.5   32.5  | -0.999  |

RMSProp:

                                       |  SPKR   |  # Snt # Wrd  |Corr    Sub    Del    Ins    Err  S.Err  |   NCE    | 
score_p0.5_7/dev10h.ctm.filt.sys:251:  | Sum/Avg | 21530  95893 | 58.2   28.6   13.2    5.3   47.0   32.5  | -0.908 |

Adagrad:

score_p2.0_9/dev10h.ctm.filt.sys:   |  SPKR     |  # Snt # Wrd  |  Corr    Sub    Del    Ins    Err  S.Err  |   NCE    | 
score_p1.0_9/dev10h.ctm.filt.sys:   |  Sum/Avg  | 21530  95893  |  49.4   35.2   15.3    4.9   55.5   33.6  |  -0.529  |

See attached: training statistics (train/validation accuracy by epoch)
TrainingAlgorithms.xlsx

fmetze · 2017-02-01T03:26:03Z

interesting - the best LM weights seem very different, as do the insertion/ deletions. how are the training times? comparable?

On Jan 31, 2017, at 5:33 PM, riebling ***@***.***> wrote: Best WERs comparison for our Haitian test set SGD: | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | NCE | score_p-0.5_10/dev10h.ctm.filt.sys:251:| Sum/Avg |21530 95893 | 61.0 30.5 8.5 8.5 47.5 32.5 | -0.999 | RMSProp: | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | NCE | score_p0.5_7/dev10h.ctm.filt.sys:251: | Sum/Avg |21530 95893 | 58.2 28.6 13.2 5.3 47.0 32.5 | -0.908 | Adagrad: score_p1.0_9/dev10h.ctm.filt.sys: | Sum/Avg |21530 95893 | 49.4 35.2 15.3 4.9 55.5 33.6 | -0.529 | TrainingAlgorithms.xlsx <https://github.com/srvk/eesen/files/743368/TrainingAlgorithms.xlsx> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#106 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEnA8URLAd0k19a-AWx-YYanSYficCTvks5rX7aigaJpZM4KoyXG>.

Florian Metze <http://www.cs.cmu.edu/directory/florian-metze> Associate Research Professor Carnegie Mellon University

riebling · 2017-02-01T16:23:54Z

how are the training times?

(what a great thing to have the training scripts print this out!)
Also interesting:
SGD: 27 hours
RMSProp: 48 hours (JUST fits in the compute window!)
Adagrad: 69 hours
(I don't have a feel for how these numbers compare with other compute clusters)

bmilde · 2017-02-07T13:05:09Z

Interesting, thanks for the evaluation! My observations are, based on the spreadsheet:

Adagrads convergence doesn't look stable with your dataset. Initial learning rate might be too high, or the (atypical) resetting of the accumulators on every epoch gets in the way of convergence.
Looks like RMSProp is overfitting heavly after the first few epochs. Have you tried the comparison on a bigger dataset? Did you use Dropout?
If you train the same number of epochs, SGD will always be faster. If you train with RMSProp, the hope is that you amortize the extra computations by finishing the training in fewer epochs. What was the stopping criterion? Did you lower the initial learning rate for RMSProp, too?

Since convergence is faster in the beginning with RMSProp, you can usually set "halving-after-epoch" and "min-iters / max-iters" to much lower values compared to SGD. I see 25 epochs for SGD and RMSProp, so I guess min-iters was set to 25?

Hopefully, I find some time soon to implement saving the accumulator state between epcohs. That should make "halving-after-epoch" unnecessary.

For the differences in insertion / deletions, my guess would be that could also fluctuate somewhat with the same optimizer if you do multiple runs.

bmilde · 2017-02-22T16:14:33Z

I've just added state saving for RMSProp. This is backwards compatible, i.e. you can still load older model files. Also, the accumulators are only saved if you train with RMSProp or Adagrad. Could you run your evaluation again with my newer version?

I can also send you some results on our data next week with the version I've just pushed. Seems that so far we have better results with RMSProp than with SGD+Momentum on our data, but I haven't compared it with learning rate annealing yet. I can also report that setting the RMSProp gradient clipping much smaller than for SGD isn't necessary anymore, it worked just fine with 40. That this parameter was problematic in the beginning was probably due to the exploding accumulators in the affine layers, because they didn't do the clipping at all. Also, I haven't tested Adagrad yet with the state saving.

riebling · 2017-03-08T18:59:28Z

Of course we can test this out, sorry for not replying sooner (hectic, bla bla). I have some time on an outside compute cluster (ours is currently down) to give this a try and will let you know.

riebling · 2017-04-04T14:46:32Z

Here please find a chart with training Accuracy graphs for baseline, Adagrad, and Rmsprop, from the latest code in this branch. I'm not sure why Adagrad took so long to start moving.
AlgorithmComparisonHaitian.xlsx

bmilde · 2017-04-04T15:22:23Z

Thanks and dito sorry for not replying sooner! I'll also send you an off-list reply later this week, with results on our German corpus, where we see gains in convergence speed and also WER with RMSProp. I've mostly used RMSProp for my experiments now though, with no decreasing (initial) learning rate. From the data you've send me I guess that the initial learning rate is too high for Adagrad, but if it proves to be too tricky to use, I'd suggest to disable it in this pull request.

riebling · 2017-06-23T14:50:02Z

Looks pretty good to me
AlgorithmComparisonTedlium.xlsx

bmilde added 8 commits October 21, 2016 15:52

Adagrad updates for lstm-layer trainable-layer and affine-trans-layer

60e4e72

Added missing cudamatrix and cudavector functions, changes so that th…

bbb0260

…e code compiles

Added missing cudamatrix and cudavector functions, changes so that th…

b0524f5

…e code compiles

fixed initialization of temporary ada variables

3bca673

show metrics of the accumulator at the end of a epoch, fixed typo in …

579fc48

…accuupdate call in affine-trans-layer, fixed initialization of temporary ada variables in uni-lstm

fixed typo in uni-lstm

279737f

fixed another typo in uni-lstm

3557939

Added missing gradient clipping for affine-trans-layer

b26bc7b

bmilde added 9 commits November 4, 2016 13:40

max_grad needs to be properly initialized

ce72199

Added maxgrad option to model_topo, also added max grad to affine tra…

37dbc0c

…nsform layers

make max_grad_ persistent in affine trans layer

cfc5b91

Correctly apply gradient clipping in affine-trans-layer after gradien…

5c8e195

…t computation

Moved Adagrad_epsilon to train-opts.h

29b3778

Added RMSProp

4c359d4

Added RMSProp in train-ctc-parallel

2eb984f

corrected rho parameter in po

95005b1

typo

62a9297

bmilde changed the title ~~Implementation of an adaptive learning rate algorithm (Adagrad)~~ Improved optimization and implementation of adaptive learning rate algorithms (Adagrad, RMSProp) Nov 10, 2016

Added accumolator saving for Adagrad and RMSProp.

03ca009

fmetze merged commit d776976 into srvk:master Jun 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved optimization and implementation of adaptive learning rate algorithms (Adagrad, RMSProp) #106

Improved optimization and implementation of adaptive learning rate algorithms (Adagrad, RMSProp) #106

bmilde commented Nov 3, 2016 •

edited

Loading

bmilde commented Nov 4, 2016

bmilde commented Nov 10, 2016

riebling commented Jan 24, 2017

bmilde commented Jan 25, 2017

bmilde commented Jan 26, 2017

riebling commented Jan 26, 2017

riebling commented Jan 26, 2017 •

edited

Loading

riebling commented Jan 27, 2017

fmetze commented Jan 27, 2017

riebling commented Jan 27, 2017 •

edited

Loading

riebling commented Jan 31, 2017 •

edited

Loading

fmetze commented Feb 1, 2017 via email

riebling commented Feb 1, 2017 •

edited

Loading

bmilde commented Feb 7, 2017

bmilde commented Feb 22, 2017

riebling commented Mar 8, 2017

riebling commented Apr 4, 2017

bmilde commented Apr 4, 2017 •

edited

Loading

riebling commented Jun 23, 2017

Improved optimization and implementation of adaptive learning rate algorithms (Adagrad, RMSProp) #106

Improved optimization and implementation of adaptive learning rate algorithms (Adagrad, RMSProp) #106

Conversation

bmilde commented Nov 3, 2016 • edited Loading

bmilde commented Nov 4, 2016

bmilde commented Nov 10, 2016

riebling commented Jan 24, 2017

bmilde commented Jan 25, 2017

bmilde commented Jan 26, 2017

riebling commented Jan 26, 2017

riebling commented Jan 26, 2017 • edited Loading

riebling commented Jan 27, 2017

fmetze commented Jan 27, 2017

riebling commented Jan 27, 2017 • edited Loading

riebling commented Jan 31, 2017 • edited Loading

fmetze commented Feb 1, 2017 via email

riebling commented Feb 1, 2017 • edited Loading

bmilde commented Feb 7, 2017

bmilde commented Feb 22, 2017

riebling commented Mar 8, 2017

riebling commented Apr 4, 2017

bmilde commented Apr 4, 2017 • edited Loading

riebling commented Jun 23, 2017

bmilde commented Nov 3, 2016 •

edited

Loading

riebling commented Jan 26, 2017 •

edited

Loading

riebling commented Jan 27, 2017 •

edited

Loading

riebling commented Jan 31, 2017 •

edited

Loading

riebling commented Feb 1, 2017 •

edited

Loading

bmilde commented Apr 4, 2017 •

edited

Loading