# Welcome to Exkaldi

In this section, we will train a n-grams language model and query it.

Althrough __Srilm__ is avaliable in exkaldi, we recommend __Kenlm__ toolkit.

In [1]:
import exkaldi

import os
dataDir = os.path.join("..","examplesdata","librispeech_dummy")

Firstly,  prepare the lexicons.

We have generated and saved a LexiconBank object in file already (3_prepare_lexicons). So restorage it directly.

In [2]:
lexFile = os.path.join(dataDir, "exp", "lexicons.lex")

lexicons = exkaldi.decode.graph.load_lex(lexFile)

lexicons

<exkaldi.decode.graph.LexiconBank at 0x7f79cc9f8a90>

If you want to use __Srilm__ backend, run this code to prepare it firstly. If __Kenlm__ backend, you don't need to do anything.

In [3]:
ExkaldiInfo = exkaldi.version

ExkaldiInfo.prepare_srilm()

ExkaldiInfo.ENV["PATH"]

'/misc/home/usr18/wangyu/.virtualenvs/tfenv/bin:/home/usr18/wangyu/anaconda3/bin:/usr/local/cuda/bin:/home/usr18/wangyu/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/snap/bin:/Work18/wangyu/kaldi/src/Gambian:/Work18/wangyu/kaldi/src/bin:/Work18/wangyu/kaldi/tools/openfst:/Work18/wangyu/kaldi/tools/openfst/bin:/Work18/wangyu/kaldi/src/featbin:/Work18/wangyu/kaldi/src/GAMbian:/Work18/wangyu/kaldi/src/nnetbin:/Work18/wangyu/kaldi/src/lmbin:/Work18/wangyu/kaldi/src/fstbin:/Work18/wangyu/kaldi/src/latbin:/Work18/wangyu/kaldi/src/gmmbin:/Work18/wangyu/kaldi/tools/srilm:/Work18/wangyu/kaldi/tools/srilm/bin:/Work18/wangyu/kaldi/tools/srilm/bin/i686-m64'

Even though we have prepared a transcription file in the data directory, we do not need the utterance-ID information at the head of each line. Because we don't have extra corpus avaliable now, we must take a bit of work to produce one. 

In [4]:
textFile = os.path.join(dataDir, "train", "text")
newTextFile = os.path.join(dataDir, "exp", "text_no_uttID")

exkaldi.utils.make_dependent_dirs(newTextFile, True)

with open(textFile, "r", encoding="utf-8") as fr:
    lines = fr.readlines()

temp = []
for line in lines:
    line = line.strip().split(maxsplit=1)
    if len(line) < 2:
        continue
    else:
        temp.append(line[1])

with open(newTextFile, "w", encoding="utf-8") as fw:
    fw.write( "\n".join(temp) )

Now we train a 3-grams model with __Kenlm__ backend. (In exkaldi version 1.0, Error will be raised if RAM is not enough.)

In [5]:
arpaFile = os.path.join(dataDir, "exp", "3-gram.arpa")

exkaldi.lm.train_ngrams_kenlm(lexicons, order=3, textFile=newTextFile, outFile=arpaFile)

'/misc/Work19/wangyu/exkaldi-1.0/examplesdata/librispeech_dummy/exp/3-gram.arpa'

ARPA model can be transform to binary format in order to accelerate loading and reduce memory cost.

Although KenLm python API supports reading ARPA format, but in exkaldi, we only expected KenLM Binary format.

In [6]:
binaryLmFile = os.path.join(dataDir, "exp", "3-gram.binary")

exkaldi.lm.arpa_to_binary(arpaFile, binaryLmFile)

'/misc/Work19/wangyu/exkaldi-1.0/examplesdata/librispeech_dummy/exp/3-gram.binary'

Use the binary LM file to initialize a Python KenLM n-grams object.

In [7]:
model = exkaldi.lm.KenNGrams(binaryLmFile)

__KenNGrams__ is but a simple wrapper of KenLM python Model. N-grams up to 6 orders can be use.

Query it with a sentence.

In [8]:
model.score("HELLO WORLD", bos=True, eos=True)

-8.523445129394531