-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asking help for training LM using returnn. #21
Comments
You can download the LM model here. The LM train data can be downloaded from the official page. The |
If you download the training data file from the link in @albertz's comment above. The number should match. Also, in our newest setups, we do not use that option anymore. |
Closing as all original questions have been answered. |
Thanks for your replay. The LM training can be done now. But I still have some confusions. First, I think the trans.bpe.vocab.lm.txt is obtained through librispeech training date. And the the LM model training dataset can be much larger than that. So is the trans.bpe.vocab.lm.txt still capable for the training ? Or I should first apply the ./apply_bpe.py -c trans.bpe.codes to get the bpe , and then replace the subwords that don't show up in trans.bpe.vocab.lm.txt in to unk? Otherwise, I can't see the function of label unk |
That is correct.
Yes, that is not a problem if you apply If you do so, you will only get OOVs for unknown characters (therefore for LibriSpeech, you might not need an extra unknown token: If you prefer you could remove it, but I do not expect much effect on the recognition results). |
Hi,
I added more training date for better performance . Because the corpus has change, I want to train a new LM for my data. But the exactly form of data is not offered on Git such as"train": "/work/asr3/irie/data/librispeech/lm_bpe/librispeech-lm-norm.bpe.txt.gz". Can you offer me a example?
Second, when I run ./returnn/rnn.py **.config , a FileNotFoundError occurred err_msg = "No such file or directory: 'cf'", len = 31. So I'd like to now the function of def cf(filename) in config file.
And the cf command is for ?
Last but not the least, i want to know why train_num_seqs = 40418260 , but there are only 281241 sentences in train dataset.
Thanks for your great work.
The text was updated successfully, but these errors were encountered: