Asking help for training LM using returnn. #21

yanghongjiazheng · 2019-07-19T10:06:11Z

Hi,
I added more training date for better performance . Because the corpus has change, I want to train a new LM for my data. But the exactly form of data is not offered on Git such as"train": "/work/asr3/irie/data/librispeech/lm_bpe/librispeech-lm-norm.bpe.txt.gz". Can you offer me a example?
Second, when I run ./returnn/rnn.py **.config , a FileNotFoundError occurred err_msg = "No such file or directory: 'cf'", len = 31. So I'd like to now the function of def cf(filename) in config file.
And the cf command is for ?
Last but not the least, i want to know why train_num_seqs = 40418260 , but there are only 281241 sentences in train dataset.
Thanks for your great work.

albertz · 2019-07-19T10:32:17Z

You can download the LM model here.

The LM train data can be downloaded from the official page.
It might need some post processing to prepare that file librispeech-lm-norm.bpe.txt.gz but it should be straightforward. (Maybe ask @kazuki-irie for details on that.)

The cf function uses some tool of us, but you can ignore it. Just remove it.

kazuki-irie · 2019-07-19T11:51:34Z

/work/asr3/irie/data/librispeech/lm_bpe/librispeech-lm-norm.bpe.txt.gz is a plain line based text (without sentence boundaries).
For this particular example, it is pre-processed to be on the BPE level.

Last but not the least, i want to know why train_num_seqs = 40418260 , but there are only 281241 sentences in train dataset.

If you download the training data file from the link in @albertz's comment above. The number should match. Also, in our newest setups, we do not use that option anymore.
We are uploading better LM configs/pre-trained models here:
https://github.com/rwth-i6/returnn-experiments/tree/master/2019-lm-transformers/librispeech

kazuki-irie · 2019-07-19T16:16:17Z

Closing as all original questions have been answered.
Please feel free to re-open if you have further questions.

yanghongjiazheng · 2019-07-22T04:24:33Z

Thanks for your replay. The LM training can be done now. But I still have some confusions. First, I think the trans.bpe.vocab.lm.txt is obtained through librispeech training date. And the the LM model training dataset can be much larger than that. So is the trans.bpe.vocab.lm.txt still capable for the training ? Or I should first apply the ./apply_bpe.py -c trans.bpe.codes to get the bpe , and then replace the subwords that don't show up in trans.bpe.vocab.lm.txt in to unk? Otherwise, I can't see the function of label unk

kazuki-irie · 2019-07-22T08:47:57Z

First, I think the trans.bpe.vocab.lm.txt is obtained through librispeech training date.

That is correct.

And the the LM model training dataset can be much larger than that. So is the trans.bpe.vocab.lm.txt still capable for the training ?

Yes, that is not a problem if you apply subword-nmt/apply_bpe.py to your LM training dataset with correct flags (exactly as you should be doing for dev and test sets): Both -c flag to provide your BPE code and --vocabulary flag to specify your vocabulary must be given.

If you do so, you will only get OOVs for unknown characters (therefore for LibriSpeech, you might not need an extra unknown token: If you prefer you could remove it, but I do not expect much effect on the recognition results).

albertz mentioned this issue Jul 19, 2019

Shallow fusion of LSTM LM #4

Closed

kazuki-irie closed this as completed Jul 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asking help for training LM using returnn. #21

Asking help for training LM using returnn. #21

yanghongjiazheng commented Jul 19, 2019

albertz commented Jul 19, 2019

kazuki-irie commented Jul 19, 2019

kazuki-irie commented Jul 19, 2019

yanghongjiazheng commented Jul 22, 2019 •

edited

Loading

kazuki-irie commented Jul 22, 2019

Asking help for training LM using returnn. #21

Asking help for training LM using returnn. #21

Comments

yanghongjiazheng commented Jul 19, 2019

albertz commented Jul 19, 2019

kazuki-irie commented Jul 19, 2019

kazuki-irie commented Jul 19, 2019

yanghongjiazheng commented Jul 22, 2019 • edited Loading

kazuki-irie commented Jul 22, 2019

yanghongjiazheng commented Jul 22, 2019 •

edited

Loading