Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local attention with unidirectional lstm not converging #41

Closed
manish-kumar-garg opened this issue Jan 13, 2020 · 5 comments
Closed

local attention with unidirectional lstm not converging #41

manish-kumar-garg opened this issue Jan 13, 2020 · 5 comments

Comments

@manish-kumar-garg
Copy link

Hi @albertz ,
I tried with different learning rates but model seems to be not converging after changing bidirectional lstm to unidirectional lstm in case of local attention setup.
Can you suggest something?
What else should I try.

@albertz
Copy link
Member

albertz commented Jan 14, 2020

It seems a very related question was also asked in #42.
I have not tried unidir LSTMs in the encoder yet, so I don't know. You probably should play around with all the available hyper params, e.g.:

  • Learning rate (initial at warmup, and highest learning rate after warmup)
  • Learning rate warmup length (num epochs)
  • Pretraining. E.g. start number of layers (try 2). Initial time reduction (try increasing it, e.g. 6, 8, 16, or even 32). Try make it longer (more repetitions). Etc.
  • No or less SpecAugment in the beginning.
  • Higher batch size in the beginning. Or gradient accumulation in the beginning.
  • Curriculum learning, i.e. the epoch_wise_filter option.
  • ...

Let this smallest network with highest time reduction, high batch size, less/no SpecAugment etc train like that for as long as needed, before increasing anything. This small network should first get some half-way good score. Only when you see that, at that point the pretraining can increase the depth and other things, but only carefully and slowly (such that the network does not totally break again).

@manish-kumar-garg
Copy link
Author

manish-kumar-garg commented Jan 16, 2020

Thanks @albertz for suggesting these.
I trained following models upto pretraining epochs (45) with following observation of loss:

  1. Base model - asr_2018_attention - below hyperparameters:
pretrain = {"repetitions": 5, "construction_algo": custom_construction_algo}
learning_rate = 0.0008
learning_rates = list(numpy.linspace(0.0003, learning_rate, num=15))  # warmup

image

  1. Uni lstm size 1024 with all hyperparameters same as 1:
    image

  2. Uni lstm size 1024 with all hyperparams same as 1 except:
    pretrain = {"repetitions": 7, "construction_algo": custom_construction_algo}
    image

  3. Uni lstm size 1024 with all hyperparams same as 1 except:
    learning_rate = 0.0005
    image

  4. Uni lstm size 1024 with all hyperparams same as 1 except warmup steps 10:
    learning_rates = list(numpy.linspace(0.0003, learning_rate, num=10)) # warmup
    image

  5. Uni lstm size 1024 with all hyperparams same as 1 except warmup steps 20:
    learning_rates = list(numpy.linspace(0.0003, learning_rate, num=20)) # warmup
    In this case loss becomes nan after 10 epochs

  6. Uni lstm size 1536 with all hyperparameters same as 1:
    image

All the models are with global attention

@manish-kumar-garg
Copy link
Author

Seems like decreasing the learning rate helps.
Also, increasing the lstm cell size to 1536 helps a bit, however, not much.

What other combination do you suggest to try next?

@albertz
Copy link
Member

albertz commented Jan 17, 2020

All the things which I wrote already (here), but basically everything else as well.

@manish-kumar-garg
Copy link
Author

Lowering the learning rate lr=0.0005 and lr_init=0.0002 worked for me.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants