# Welcome to Exkaldi

In this section, we will train a n-grams language model and query it.

Althrough __Srilm__ is avaliable in exkaldi, we recommend __Kenlm__ toolkit.

In [1]:
import exkaldi

import os
dataDir = "librispeech_dummy"

Firstly, prepare the lexicons. We have generated and saved a __LexiconBank__ object in file already (3_prepare_lexicons). So restorage it directly.

In [2]:
lexFile = os.path.join(dataDir, "exp", "lexicons.lex")

lexicons = exkaldi.decode.graph.load_lex(lexFile)

lexicons

<exkaldi.decode.graph.LexiconBank at 0x7f2175144278>

We will use training text corpus to train LM model. Even though we have prepared a transcription file in the data directory, we do not need the utterance-ID information at the head of each line, so we must take a bit of work to produce a new text.

In [3]:
textFile = os.path.join(dataDir, "train", "text")
newTextFile = os.path.join(dataDir, "exp", "train_lm_text")

exkaldi.utils.make_dependent_dirs(newTextFile, True)

with open(textFile, "r", encoding="utf-8") as fr:
    lines = fr.readlines()

temp = []
for line in lines:
    line = line.strip().split(maxsplit=1)
    if len(line) < 2:
        continue
    else:
        temp.append(line[1])

with open(newTextFile, "w", encoding="utf-8") as fw:
    fw.write( "\n".join(temp) )

Now we train a 2-grams model with __Kenlm__ backend. (__srilm__ backend is also avaliable.)

In [4]:
arpaFile = os.path.join(dataDir, "exp", "2-gram.arpa")

exkaldi.lm.train_ngrams_kenlm(lexicons, order=2, textFile=newTextFile, outFile=arpaFile, config={"-S":"20%"})

'/misc/Work19/wangyu/exkaldi-1.2/tutorials/librispeech_dummy/exp/2-gram.arpa'

ARPA model can be transform to binary format in order to accelerate loading and reduce memory cost.  
Although __KenLm__ Python API supports reading ARPA format, but in exkaldi, we only expected KenLM Binary format.

In [5]:
binaryLmFile = os.path.join(dataDir, "exp", "2-gram.binary")

exkaldi.lm.arpa_to_binary(arpaFile, binaryLmFile)

'/misc/Work19/wangyu/exkaldi-1.2/tutorials/librispeech_dummy/exp/2-gram.binary'

Use the binary LM file to initialize a Python KenLM n-grams object.

In [6]:
model = exkaldi.lm.KenNGrams(binaryLmFile)

model

<exkaldi.lm.lm.KenNGrams at 0x7f2174ece160>

__KenNGrams__ is simple wrapper of KenLM python Model. Defaultly, N-grams up to 6 orders can be use. If you want to use training bigger N-Grams, change the configure when you install the exkaldi pypi package from our github.

You can query this model with a sentence.

In [7]:
model.score("HELLO WORLD", bos=True, eos=True)

-8.531333923339844

Or compute the perplexity to evaluate a language model.

In [8]:
evalTrans = exkaldi.load_trans(os.path.join(dataDir, "test", "text"))

score = model.perplexity(evalTrans)
score

{'1272-128104-0000': 389.82728639223217,
 '1272-128104-0001': 1259.8703186980435,
 '1272-128104-0002': 634.7298958519973,
 '1272-128104-0003': 700.4998070862594,
 '1272-128104-0004': 640.942524338446,
 '1272-135031-0000': 884.1503819354043,
 '1272-135031-0001': 398.2710010651754,
 '1272-135031-0002': 237.274884147695,
 '1272-135031-0003': 246.7711624456322,
 '1272-135031-0004': 445.0821986932373,
 '1272-141231-0000': 242.7351101272028,
 '1272-141231-0001': 632.2589965581269,
 '1272-141231-0002': 563.2843089015221,
 '1272-141231-0003': 533.2710688117827,
 '1272-141231-0004': 370.62070021926473,
 '1462-170138-0000': 467.84169223100076,
 '1462-170138-0001': 598.6470090870347,
 '1462-170138-0002': 722.879667569615,
 '1462-170138-0003': 1245.2999525714333,
 '1462-170138-0004': 116.89921221929579}

In [9]:
type(score)

exkaldi.core.achivements.Metric

___score___ is an exkaldi __Metric__ (a subclass of Python dict) object. We design a group of classes to hold Kaldi text format table and exkaldi own text format data.

__ListTable__: spk2utt, utt2spk, words, phones and so on.  
__Transcription__: transcription corpus, n-best decoding result and so on.  
__Metric__: AM score, LM score, LM perplexity, Sentence lengthes and so on.  
__BytesIndexTable__: the memory index of a bytes data.  

All these classes are subclasses of Python dict. They have some same and respective methods and attributes. For example, we can compute the average value of __Metric__.

In [10]:
score.mean()

566.5578589475201

Finally, we generate the Grammar fst for futher steps.

In [11]:
Gfile = os.path.join(dataDir, "exp", "G.fst")

exkaldi.decode.graph.make_G(lexicons, arpaFile, outFile=Gfile, order=2)

'/misc/Work19/wangyu/exkaldi-1.2/tutorials/librispeech_dummy/exp/G.fst'