Tiny Language Model Building #19

prashantserai · 2016-12-13T15:17:12Z

I'm working on a project where we're trying to recognize spoken sentences in a very specific technical domain and context using the Virtual Machine based EESEN Offline Transcriber. As one of the experiments, we wanted to try a deterministic language model for a fixed set of sentences and words. I noticed on http://speechkitchen.org/kaldi-language-model-building/ that there's a recipe to build a Tiny Language Model, but I couldn't find the requisite files in the Virtual Machine. Any suggestions as to how I could go about building the same?

riebling · 2016-12-13T20:00:50Z

The files may have been added since you downloaded the VM; I think a 'git pull' from the srvk/lm_build repository may get the newest? Or have a look and copy it directly from: https://github.com/srvk/lm_build/blob/master/make_tinylm_graph.sh

prashantserai · 2016-12-14T14:21:23Z

Thanks for your response!

I downloaded that file into ~/eesen/asr_egs/tedlium/v2-30ms/lm_build and copied training_trans_fst.py from ~/tools/eesen-offline-transcriber/local/ to the same place.

My example_txt was a sequence of words like this:
"a afternoon alex all am and any application applications apply are at autumn award be between by closes conferences covers day deadline deleted doing dot each edu eleven email everyone feel fifty finds first for free funding good great have hope i if including into is january know let march materials me message nine note november now occurring of one open osu out outside period please pm questions ray reach recommendation recovered responses saw semester sixteen submitted that the third thirty this three to today travel tuesday twenty unable we well when will window writing you"

which are all the different words that are used in my audio separated by spaces.

I'm confused although, what exactly should I expect the created deterministic language model to do, select the most probable word out of these 97 possible words, or?

I ask because the words.txt created in the tinylm folder has a list of 150k words roughly. Seems like it took stuff from ../data/lang_phn_test_test_newlm as well.

What exactly does make_tinylm_graph.sh take and what does it create?

riebling · 2016-12-14T19:19:51Z

It's more geared to work with example sentences, than a bag-of-words. So you'd want example text of every permutation of permissible sentence you'd like the system to recognize.

Training the 'tiny' LM still uses a general-purpose dictionary to create a lexicon "L", along with the example_txt sentences creates a grammar "G" (every possible sequence of words from first-word to last-word in a sentence), and the already-provided graph of tokens, "T" (This version of Eesen is trained to use phonemes as the tokens, but characters are another way it can work). These are all composed together into a decoding graph TLG.fst.

If you have words you want to add that aren't in the provided general-purpose dictionary, you'll need to add them, as in the earlier instructions

When decoding audio to produce text, audio is first converted to a sequence of features. These features go into the already-provided trained neural network that represents the acoustic model, which predicts a sequence of tokens (phonemes). This sequence, along with the decoding graph TLG.fst produces text results, as the end result of the decode process.

In your example, only if someone speaks words in the order they were provided, is the system likely to produce output that makes sense - but even then it will not make sense because the words aren't in a sensible order

riebling · 2018-07-27T15:11:10Z

closing; aged out

riebling closed this as completed Jul 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tiny Language Model Building #19

Tiny Language Model Building #19

prashantserai commented Dec 13, 2016

riebling commented Dec 13, 2016

prashantserai commented Dec 14, 2016 •

edited

Loading

riebling commented Dec 14, 2016

riebling commented Jul 27, 2018

Tiny Language Model Building #19

Tiny Language Model Building #19

Comments

prashantserai commented Dec 13, 2016

riebling commented Dec 13, 2016

prashantserai commented Dec 14, 2016 • edited Loading

riebling commented Dec 14, 2016

riebling commented Jul 27, 2018

prashantserai commented Dec 14, 2016 •

edited

Loading