Adding Technical Words to Dictionary #18

bakerstreetsystems · 2016-11-29T17:23:59Z

This package is awesome so far! It was WAY simpler to get everything up and going than any of the other methods (i.e. installing SMU Sphinx or Kaldi directly from source). So thank you for a great package.

I'd like to be able to transcribe very technical audio recordings with words like Linux, Laravel, or MySQL, which don't get transcribed very well. How would I go about (easily) adding these words to the transcription software so that they are successfully recognized?

riebling · 2016-11-29T19:11:47Z

There's some info here
that describes how to add new terms to the language model. There's even a
feature that tries to guess phonetic pronunciation and generate dictionary entries
for new words, though I imagine you could improve on the results if you have a
better grasp of the pronunciation(s).

bakerstreetsystems · 2016-11-29T23:47:23Z

Thank you for your help so far! I've attempted to follow the directions suggested here.

I can successfully run the run_adapt.sh script after adding new vocab to newwords.txt, but when I try to use the updated language model to transcribe the audio file with the new vocab, it doesn't recognize the new vocab.

Here is a video of my attempt to follow the directions on how to adapt the language model:

https://www.youtube.com/watch?edit=vd&v=-Zn9_y56R4c

Any suggestions?

riebling · 2016-11-30T15:46:34Z

Suggestions? sure!

I may have left out a step. You not only have to add words to the dictionary (and let the system add phonetic pronunciations), but also add to the example_txt adaptation text, with examples of the words "in use". Otherwise the language model being constructed has no statistical likelihoods of the new word being linked to previous or subsequent word (sequences).

It's been our experience that to get new phrases to be recognized - with new words, we need to repeat the examples in the training text quite a bit, sometimes over a hundred times, just to increase the statistical likelihood to better the chances the new words and phrases will be predicted during decoding. Appending or pre-pending (which should now not make a difference, though it used to) to the file example_txt has to happen somewhere in the sequence.

Thanks for noticing and trying this out. We should update the documentation to reflect this :)

bakerstreetsystems · 2016-11-30T16:02:20Z

Thanks for the reply!

Though my video doesn't show me adding any examples of the new vocab being used in sentences, I did try to do this on my own. I added phrases like "i like to program using laravel" and "sometimes laravel is the best tool to use and sometimes it is not" to the example_txt file and then ran the run_adapt.sh script. When that did not work, I tried adding a few more phrases with the word "laravel" in them and then I copied and pasted all the Laravel-related phrases many times to (hopefully) increase their statistical relevance. That didn't work either :-(

Any other suggestions?

riebling · 2016-11-30T16:44:45Z

Then it's getting to the voodoo stage. I remember trying to verify new words could be recognized, and seeing different behavior depending on whether I added to the beginning or end of example_txt. There was a situation I believe is fixed, whereby if you added to the beginning vs. the end, there was a difference, because the scripts were automatically holding out the first 10,000 examples... and so new words didn't even take effect until the new word usage examples exceeded 10,000 lines. But I'm pretty sure it's no longer doing that (we train on ALL the example_txt, and don't leave out the first 10,000)

There's also a problem if you start trying to REDUCE the size of example_txt since it is assumed to be much larger than 10,000 lines (I count 183710). So to try an extreme 'crazy' example, what if you included something like 400 repetitions of your word 'in use', both at the beginning and end of example_txt? If it still doesn't recognize, then I'm wondering if something's weird about the pronunciation that gets obtained from the online tool, added to newdict.dct:
laravel L AE R AH V AH L - maybe you could try modifying the phonetic pronunciation, since this pronunciation was just an algorithmic guess by http://www.speech.cs.cmu.edu/tools/lextool.html

Perhaps instead: laravel L EH R AH V EH L
or: laravel L AA R AH V EH L

In fact you could include both pronunciations.

bakerstreetsystems · 2016-11-30T18:07:39Z

Still no luck. Here is what I did:

Added about 600 lines with the word 'laravel' in them at the top and bottom of example_txt (so a combined 1,200 lines total).
Reran the run_adapt.sh script
Restarted the virtual machine just in case the queueing system needed to refresh the language model from disk.
Recorded a new audio clip of me saying "i like to program with laravel", which is a direct quote from one of the lines inserted into the example_txt. Then put the audio clip into the "transcribe_me" folder to be transcribed.

Here is the result:

program_with_laravel 1 0.03 1.56 i 1.00
program_with_laravel 1 1.59 0.03 like 1.00
program_with_laravel 1 1.62 0.27 to 1.00
program_with_laravel 1 1.89 0.42 program 1.00
program_with_laravel 1 2.31 0.19 with 0.76
program_with_laravel 1 2.56 0.19 a 0.27
program_with_laravel 1 2.79 0.60 tell 0.32

riebling · 2016-11-30T18:11:28Z

Including modifying the pronunciation dictionary entry before run_adapt.sh? (I added that as a later edit) README.md now describes the process

bakerstreetsystems · 2016-11-30T22:25:22Z

Woohoo!!! It works!!! I edited the newwords.dct file and replaced laravel L AE R AH V AH L with laravel L EH R AH V EH L.

But then the run_adapt.sh script would automatically overwrite my changes (because it was re-querying the CMU Speech tool to get the default pronunciation). So I added a little bit of code (below) to the run_adapt.sh script to allow a pronunciation override. Then I created a file called pronunciation_overrides.txt for, you guessed it, the override pronunciations.

The code below should be added to run_adapt.sh right after the part where it's automatically looking up the pronunciation from CMU Speech tool and right before it says "Constructing the phoneme-based lexicon". As of today, you can paste this code after line 59.

# Added by Jason Jensen to allow for pronunciation override
# If there are any words that you would like to change the default pronunciation for, 
# enter them in the pronunciation_overrides.txt file in this same directory 
# (if the file doesn't exist, create it). The format should be the same as the default dictionary
# Example for adding Laravel (a great PHP framerwork) to the dictionary:
# laravel L EH R AH V EH L

if [ -f pronunciation_overrides.txt ]; then
	echo "Looping through pronunciation overrides found in pronunciation_overrides.txt:"
	while read line || [ -n "$line" ]; do
		set -- $line
	    echo "   $line"
	    sed -i "/$1 /c $line" newdict.dct  
	done < pronunciation_overrides.txt
fi

And here is a sample of the pronunciation_overrides.txt file:

laravel L EH R AH V EH L

This kind of scripting is not all my expertise, but it works! Woohoo!

riebling · 2016-12-01T13:47:35Z

And there was much rejoicing! I appreciate your extra scripting (especially knowing it's not your forte) but actually had updated my reply on GitHub to do a slightly-less-inelegant way: directly add pronunciations to the TEDLIUM dictionary file, since it doesn't get rewritten. Very glad to see this finally worked :)

…

On Wed, November 30, 2016 5:25 pm, JJ wrote: Woohoo!!! It works!!! I edited the newwords.dct file and replaced `laravel L AE R AH V AH L` with `laravel L EH R AH V EH L`. But then the `run_adapt.sh` script would automatically overwrite my changes (because it was re-querying the CMU Speech tool to get the default pronunciation). So I added a little bit of code (below) to the `run_adapt.sh` script to allow a pronunciation override. Then I created a file called `pronunciation_overrides.txt` for, you guessed it, the override pronunciations. The code below should be added to `run_adapt.sh` right after the part where it's automatically looking up the pronunciation from CMU Speech tool and right before it says "Constructing the phoneme-based lexicon". As of today, you can paste this code after line 59. ``` # Added by Jason Jensen to allow for pronunciation override # If there are any words that you would like to change the default pronunciation for, # enter them in the pronunciation_overrides.txt file in this same directory # (if the file doesn't exist, create it). The format should be the same as the default dictionary # Example for adding Laravel (a great PHP framerwork) to the dictionary: # laravel L EH R AH V EH L if [ -f pronunciation_overrides.txt ]; then echo "Looping through pronunciation overrides found in pronunciation_overrides.txt:" while read line || [ -n "$line" ]; do set -- $line echo " $line" sed -i "/$1 /c $line" newdict.dct done < pronunciation_overrides.txt fi ``` And here is a sample of the `pronunciation_overrides.txt` file: `laravel L EH R AH V EH L ` This kind of scripting is not all my expertise, but it works! Woohoo! -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #18 (comment)

riebling closed this as completed Jan 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Technical Words to Dictionary #18

Adding Technical Words to Dictionary #18

bakerstreetsystems commented Nov 29, 2016

riebling commented Nov 29, 2016 •

edited

Loading

bakerstreetsystems commented Nov 29, 2016

riebling commented Nov 30, 2016

bakerstreetsystems commented Nov 30, 2016

riebling commented Nov 30, 2016 •

edited

Loading

bakerstreetsystems commented Nov 30, 2016

riebling commented Nov 30, 2016 •

edited

Loading

bakerstreetsystems commented Nov 30, 2016

riebling commented Dec 1, 2016 via email

Adding Technical Words to Dictionary #18

Adding Technical Words to Dictionary #18

Comments

bakerstreetsystems commented Nov 29, 2016

riebling commented Nov 29, 2016 • edited Loading

bakerstreetsystems commented Nov 29, 2016

riebling commented Nov 30, 2016

bakerstreetsystems commented Nov 30, 2016

riebling commented Nov 30, 2016 • edited Loading

bakerstreetsystems commented Nov 30, 2016

riebling commented Nov 30, 2016 • edited Loading

bakerstreetsystems commented Nov 30, 2016

riebling commented Dec 1, 2016 via email

riebling commented Nov 29, 2016 •

edited

Loading

riebling commented Nov 30, 2016 •

edited

Loading

riebling commented Nov 30, 2016 •

edited

Loading