-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recreating the lang_phn_test_test_newlm LM #2
Comments
You're quite right, this gets built from a different, smaller example text training file. In some cases this method is actually preferable and, for example, requires less memory to do decoding. The reason for this is that to train an LM on a very large example text can take a lot of RAM: over 100 GB. If you have access to that kind of memory, then maybe we could go with the original LM which I believe was provided by Cantab research, and trained using an enormous-RAM machine on AWS (or a supercompute cluster). Another improvement would be training a 4-gram vs. a tri-gram model, without pruning, but the resulting graph is so huge that using it to decode would then ALSO require a huge amount of memory. A trick is to use just the Grammar portion of such a very-large model to do rescoring, and is outlined in this commented-out section of the Vagrantfile (which includes downloading the 3.5 GB
Here is a breakdown of 3 possible decode graphs based on different LM build techniques
Now that you bring this to our attention, it is worth verifying the provided download-able decoding graph built from provided LM (1.) is comparable, if not the same as one generated following the lm_build instructions (3.). If you need to download it again, by the way, the URL is http://speechkitchen.org/vms/Data/v2-30ms.tgz The LM building process should not have overwritten that. It may be a result of an older version, in which case we suggest you start in a new VM or update from https://github.com/srvk/lm_build Good luck, and thanks for your feedback! |
Another observation: in (3.) above, the language model is actually a bit bigger because 10% of the example_txt data was not going into part of the LM even though it totally could have... so we updated the scripts to include it. This should make it better than (1.) however, not worse. And for your example, well it's entirely possible that you found an example of audio whose text consists of words that occur in orders that just aren't covered by the example_txt. What would be interesting (but in a way is 'cheating') is to include your text with the example_txt. However including it only once may produce no change in results, and repeating it numerous times (to increase statistical likelihoods in the decoding graph) is definitely cheating. :-/ |
So I was earlier using the instructions under the heading "Adapting your own Language Model for EESEN-tedlium" here
Was that a bad idea? Anyway, I now tried run_adapt.sh with original files and the resulting language model gave me transcriptions very close to what I had gotten with the pre-existing language model at v2-30ms/data/lang_phn_test_test_newlm/TLG.fst So essentially I could sorta recreate that LM, and hope to improve over that baseline with adaptation now, so I'm happy about that. |
That wasn't a bad idea, just an old idea. We've had some folks who tried very hard to add custom (non-dictionary) words and had difficulty getting them recognized, but what finally worked was the right phonetic pronunciation manually added to the dictionary - because pronunciation matters. :) You're exactly right, you should have been able to produce nearly-identical results by following the steps in |
I think the key issue I had with the old idea was that, it was not reproducing the baseline. The results were not nearly identical, they were much worse. Just wanted to point that out once again. Having said that, so long as I'm able to get things working using run_adapt.sh, I don't care much if the older recipe works or not. And yeah, thanks for reminding me about the pronunciations point PS: I created a separate thread about another issue I faced with the run_adapt.sh and even the older recipe here. |
I'm very grateful for this package and for the kind of tools and documentation, and especially for help that I'm receiving on this forum, through my own questions and through others'.
I followed the instructions on http://speechkitchen.org/kaldi-language-model-building/ and tried building a language model using the instructions at "Adapting your own Language Model for EESEN-tedlium". Following those instructions without touching example_txt or wordlist.txt in that folder didn't seem to result into the same language model as was originally present in "lang_phn_test_test_newlm", though.
The one I got in "lang_phn_test_test_newlm" seems to be significantly inferior compared to the original one available in "lang_phn_test_test_newlm" before I overwrote those files. For one specific recording, I was getting WERs around 24% with the original "lang_phn_test" and 25% with the original "lang_phn_test_test_newlm". Now, after running the adaptation scripts (but not modifying either example_txt or wordlist.txt) I got WERs around 35%!
After I actually adapted the example_txt and wordlist.txt, I could improve over 35% but it's still quite worse than 25% (the extent of it being inversely proportional to the extent of cheating done).
If I could figure out how to get to 25% without adaptation and use adaptation to improve on top of that, it might be beneficial.
Thanks!
The text was updated successfully, but these errors were encountered: