-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Technical Words to Dictionary #18
Comments
There's some info here |
Thank you for your help so far! I've attempted to follow the directions suggested here. I can successfully run the run_adapt.sh script after adding new vocab to newwords.txt, but when I try to use the updated language model to transcribe the audio file with the new vocab, it doesn't recognize the new vocab. Here is a video of my attempt to follow the directions on how to adapt the language model: https://www.youtube.com/watch?edit=vd&v=-Zn9_y56R4c Any suggestions? |
Suggestions? sure! I may have left out a step. You not only have to add words to the dictionary (and let the system add phonetic pronunciations), but also add to the example_txt adaptation text, with examples of the words "in use". Otherwise the language model being constructed has no statistical likelihoods of the new word being linked to previous or subsequent word (sequences). It's been our experience that to get new phrases to be recognized - with new words, we need to repeat the examples in the training text quite a bit, sometimes over a hundred times, just to increase the statistical likelihood to better the chances the new words and phrases will be predicted during decoding. Appending or pre-pending (which should now not make a difference, though it used to) to the file example_txt has to happen somewhere in the sequence. Thanks for noticing and trying this out. We should update the documentation to reflect this :) |
Thanks for the reply! Though my video doesn't show me adding any examples of the new vocab being used in sentences, I did try to do this on my own. I added phrases like "i like to program using laravel" and "sometimes laravel is the best tool to use and sometimes it is not" to the example_txt file and then ran the run_adapt.sh script. When that did not work, I tried adding a few more phrases with the word "laravel" in them and then I copied and pasted all the Laravel-related phrases many times to (hopefully) increase their statistical relevance. That didn't work either :-( Any other suggestions? |
Then it's getting to the voodoo stage. I remember trying to verify new words could be recognized, and seeing different behavior depending on whether I added to the beginning or end of example_txt. There was a situation I believe is fixed, whereby if you added to the beginning vs. the end, there was a difference, because the scripts were automatically holding out the first 10,000 examples... and so new words didn't even take effect until the new word usage examples exceeded 10,000 lines. But I'm pretty sure it's no longer doing that (we train on ALL the example_txt, and don't leave out the first 10,000) There's also a problem if you start trying to REDUCE the size of example_txt since it is assumed to be much larger than 10,000 lines (I count 183710). So to try an extreme 'crazy' example, what if you included something like 400 repetitions of your word 'in use', both at the beginning and end of example_txt? If it still doesn't recognize, then I'm wondering if something's weird about the pronunciation that gets obtained from the online tool, added to newdict.dct: Perhaps instead: In fact you could include both pronunciations. |
Still no luck. Here is what I did:
Here is the result: program_with_laravel 1 0.03 1.56 i 1.00 |
Including modifying the pronunciation dictionary entry before run_adapt.sh? (I added that as a later edit) README.md now describes the process |
Woohoo!!! It works!!! I edited the newwords.dct file and replaced But then the The code below should be added to
And here is a sample of the
This kind of scripting is not all my expertise, but it works! Woohoo! |
And there was much rejoicing! I appreciate your extra scripting
(especially knowing it's not your forte) but actually had updated my reply
on GitHub to do a slightly-less-inelegant way: directly add pronunciations
to the TEDLIUM dictionary file, since it doesn't get rewritten.
Very glad to see this finally worked :)
…On Wed, November 30, 2016 5:25 pm, JJ wrote:
Woohoo!!! It works!!! I edited the newwords.dct file and replaced
`laravel L AE R AH V AH L` with `laravel L EH R AH V EH L`.
But then the `run_adapt.sh` script would automatically overwrite my
changes (because it was re-querying the CMU Speech tool to get the
default pronunciation). So I added a little bit of code (below) to the
`run_adapt.sh` script to allow a pronunciation override. Then I created a
file called `pronunciation_overrides.txt` for, you guessed it, the
override pronunciations.
The code below should be added to `run_adapt.sh` right after the part
where it's automatically looking up the pronunciation from CMU Speech
tool and right before it says "Constructing the phoneme-based lexicon".
As of today, you can paste this code after line 59.
```
# Added by Jason Jensen to allow for pronunciation override
# If there are any words that you would like to change the default
pronunciation for, # enter them in the pronunciation_overrides.txt file in
this same directory # (if the file doesn't exist, create it). The format
should be the same as the default dictionary # Example for adding Laravel
(a great PHP framerwork) to the dictionary:
# laravel L EH R AH V EH L
if [ -f pronunciation_overrides.txt ]; then echo "Looping through
pronunciation overrides found in pronunciation_overrides.txt:" while read
line || [ -n "$line" ]; do set -- $line echo " $line" sed -i "/$1 /c
$line" newdict.dct
done < pronunciation_overrides.txt fi ```
And here is a sample of the `pronunciation_overrides.txt` file:
`laravel L EH R AH V EH L
`
This kind of scripting is not all my expertise, but it works! Woohoo!
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#18 (comment)
|
This package is awesome so far! It was WAY simpler to get everything up and going than any of the other methods (i.e. installing SMU Sphinx or Kaldi directly from source). So thank you for a great package.
I'd like to be able to transcribe very technical audio recordings with words like Linux, Laravel, or MySQL, which don't get transcribed very well. How would I go about (easily) adding these words to the transcription software so that they are successfully recognized?
The text was updated successfully, but these errors were encountered: