Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Technical Words to Dictionary #18

Closed
bakerstreetsystems opened this issue Nov 29, 2016 · 9 comments
Closed

Adding Technical Words to Dictionary #18

bakerstreetsystems opened this issue Nov 29, 2016 · 9 comments

Comments

@bakerstreetsystems
Copy link

This package is awesome so far! It was WAY simpler to get everything up and going than any of the other methods (i.e. installing SMU Sphinx or Kaldi directly from source). So thank you for a great package.

I'd like to be able to transcribe very technical audio recordings with words like Linux, Laravel, or MySQL, which don't get transcribed very well. How would I go about (easily) adding these words to the transcription software so that they are successfully recognized?

@riebling
Copy link
Contributor

riebling commented Nov 29, 2016

There's some info here
that describes how to add new terms to the language model. There's even a
feature that tries to guess phonetic pronunciation and generate dictionary entries
for new words, though I imagine you could improve on the results if you have a
better grasp of the pronunciation(s).

@bakerstreetsystems
Copy link
Author

Thank you for your help so far! I've attempted to follow the directions suggested here.

I can successfully run the run_adapt.sh script after adding new vocab to newwords.txt, but when I try to use the updated language model to transcribe the audio file with the new vocab, it doesn't recognize the new vocab.

Here is a video of my attempt to follow the directions on how to adapt the language model:

https://www.youtube.com/watch?edit=vd&v=-Zn9_y56R4c

Any suggestions?

@riebling
Copy link
Contributor

Suggestions? sure!

I may have left out a step. You not only have to add words to the dictionary (and let the system add phonetic pronunciations), but also add to the example_txt adaptation text, with examples of the words "in use". Otherwise the language model being constructed has no statistical likelihoods of the new word being linked to previous or subsequent word (sequences).

It's been our experience that to get new phrases to be recognized - with new words, we need to repeat the examples in the training text quite a bit, sometimes over a hundred times, just to increase the statistical likelihood to better the chances the new words and phrases will be predicted during decoding. Appending or pre-pending (which should now not make a difference, though it used to) to the file example_txt has to happen somewhere in the sequence.

Thanks for noticing and trying this out. We should update the documentation to reflect this :)

@bakerstreetsystems
Copy link
Author

Thanks for the reply!

Though my video doesn't show me adding any examples of the new vocab being used in sentences, I did try to do this on my own. I added phrases like "i like to program using laravel" and "sometimes laravel is the best tool to use and sometimes it is not" to the example_txt file and then ran the run_adapt.sh script. When that did not work, I tried adding a few more phrases with the word "laravel" in them and then I copied and pasted all the Laravel-related phrases many times to (hopefully) increase their statistical relevance. That didn't work either :-(

Any other suggestions?

@riebling
Copy link
Contributor

riebling commented Nov 30, 2016

Then it's getting to the voodoo stage. I remember trying to verify new words could be recognized, and seeing different behavior depending on whether I added to the beginning or end of example_txt. There was a situation I believe is fixed, whereby if you added to the beginning vs. the end, there was a difference, because the scripts were automatically holding out the first 10,000 examples... and so new words didn't even take effect until the new word usage examples exceeded 10,000 lines. But I'm pretty sure it's no longer doing that (we train on ALL the example_txt, and don't leave out the first 10,000)

There's also a problem if you start trying to REDUCE the size of example_txt since it is assumed to be much larger than 10,000 lines (I count 183710). So to try an extreme 'crazy' example, what if you included something like 400 repetitions of your word 'in use', both at the beginning and end of example_txt? If it still doesn't recognize, then I'm wondering if something's weird about the pronunciation that gets obtained from the online tool, added to newdict.dct:
laravel L AE R AH V AH L - maybe you could try modifying the phonetic pronunciation, since this pronunciation was just an algorithmic guess by http://www.speech.cs.cmu.edu/tools/lextool.html

Perhaps instead: laravel L EH R AH V EH L
or: laravel L AA R AH V EH L

In fact you could include both pronunciations.

@bakerstreetsystems
Copy link
Author

Still no luck. Here is what I did:

  • Added about 600 lines with the word 'laravel' in them at the top and bottom of example_txt (so a combined 1,200 lines total).
  • Reran the run_adapt.sh script
  • Restarted the virtual machine just in case the queueing system needed to refresh the language model from disk.
  • Recorded a new audio clip of me saying "i like to program with laravel", which is a direct quote from one of the lines inserted into the example_txt. Then put the audio clip into the "transcribe_me" folder to be transcribed.

Here is the result:

program_with_laravel 1 0.03 1.56 i 1.00
program_with_laravel 1 1.59 0.03 like 1.00
program_with_laravel 1 1.62 0.27 to 1.00
program_with_laravel 1 1.89 0.42 program 1.00
program_with_laravel 1 2.31 0.19 with 0.76
program_with_laravel 1 2.56 0.19 a 0.27
program_with_laravel 1 2.79 0.60 tell 0.32

@riebling
Copy link
Contributor

riebling commented Nov 30, 2016

Including modifying the pronunciation dictionary entry before run_adapt.sh? (I added that as a later edit) README.md now describes the process

@bakerstreetsystems
Copy link
Author

Woohoo!!! It works!!! I edited the newwords.dct file and replaced laravel L AE R AH V AH L with laravel L EH R AH V EH L.

But then the run_adapt.sh script would automatically overwrite my changes (because it was re-querying the CMU Speech tool to get the default pronunciation). So I added a little bit of code (below) to the run_adapt.sh script to allow a pronunciation override. Then I created a file called pronunciation_overrides.txt for, you guessed it, the override pronunciations.

The code below should be added to run_adapt.sh right after the part where it's automatically looking up the pronunciation from CMU Speech tool and right before it says "Constructing the phoneme-based lexicon". As of today, you can paste this code after line 59.

# Added by Jason Jensen to allow for pronunciation override
# If there are any words that you would like to change the default pronunciation for, 
# enter them in the pronunciation_overrides.txt file in this same directory 
# (if the file doesn't exist, create it). The format should be the same as the default dictionary
# Example for adding Laravel (a great PHP framerwork) to the dictionary:
# laravel L EH R AH V EH L

if [ -f pronunciation_overrides.txt ]; then
	echo "Looping through pronunciation overrides found in pronunciation_overrides.txt:"
	while read line || [ -n "$line" ]; do
		set -- $line
	    echo "   $line"
	    sed -i "/$1 /c $line" newdict.dct  
	done < pronunciation_overrides.txt
fi

And here is a sample of the pronunciation_overrides.txt file:

laravel L EH R AH V EH L

This kind of scripting is not all my expertise, but it works! Woohoo!

@riebling
Copy link
Contributor

riebling commented Dec 1, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants