## Step 0: Prerequisite

To run this notebook, you need to build the decoder binaries and runtime first. Please refer to [README.md](../LanguageModelDecoder/README.md) for more details.

You will need at least **230GB** of free disk space and **100GB** of RAM to run this notebook.


## Step 1: Prepare language model training corpus. 

The training corpus should be a text file with one sentence per line. Here we use [OpenWebText2](https://openwebtext2.readthedocs.io/en/latest/) as an example.


In [1]:
%%sh

CORPUS_DIR=/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_corpus


Now we need to concatenate all the text files into one big file.
Make sure you have python libraries `zstandard`, `jsonlines`, and `tqdm` installed.

In [2]:

import datetime
from tqdm.notebook import tqdm

def json_serial(obj):
    """JSON serializer for objects not serializable by default json code"""

    if isinstance(obj, (datetime.datetime,)):
        return obj.isoformat()
    raise TypeError ("Type %s not serializable" % type(obj))

merged_text_path = '/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_corpus/dataset.txt'

import json
from typing import Literal, Optional, cast
from datasets import load_dataset
import os

cache_dir = '/hpi/fs00/scratch/leon.hermann/b2t/cache'

sentences = cast(
                list[str],
                load_dataset(
                    "generics_kb",
                    "generics_kb_best",
                    cache_dir=cache_dir,
                    data_dir=os.path.join(os.path.join(cache_dir, "generics_kb_best"),"data"),
                    split="train",
                )[  # type: ignore
                    "generic_sentence"
                ],
            )

def write_sentences_to_file(sentences, filename):
    with open(filename, 'w') as file:
        for sentence in tqdm(sentences):
            file.write(json.dumps({"text":sentence, "meta": None}, default=json_serial) + "\n")

write_sentences_to_file(sentences, merged_text_path)
print(f"Sentences have been written to '{merged_text_path}'")

100%|██████████| 1020868/1020868 [00:07<00:00, 130664.62it/s]

Sentences have been written to '/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_corpus/openwebtext2.txt'





## Step 2: Download CMU dictionary

In [3]:
%%bash

wget https://github.com/Alexir/CMUdict/raw/master/cmudict-0.7b -O /hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_corpus/cmudict.txt

--2024-03-13 14:18:59--  https://github.com/Alexir/CMUdict/raw/master/cmudict-0.7b
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Alexir/CMUdict/master/cmudict-0.7b [following]
--2024-03-13 14:18:59--  https://raw.githubusercontent.com/Alexir/CMUdict/master/cmudict-0.7b
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3865710 (3.7M) [text/plain]
Saving to: ‘/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_corpus/cmudict.txt’

     0K .......... .......... .......... .......... ..........  1% 2.49M 1s
    50K .......... .......... .......... .......... ..........  2% 8.97M 1s
   100K .......... ..

## Step 3: Build language model

Build a 3-gram language model based on the OpenWebText2 corpus.

In [4]:
%%bash

set -xe

LM_ROOT=../LanguageModelDecoder/examples/speech/s0/
LM_CORPUS_DIR=/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_corpus
LM_MODEL_DIR=/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model

cd $LM_ROOT
echo $PWD
. path.sh

# First step is formatting the text corpus.
mkdir -p $LM_MODEL_DIR/data/local/lm_data
python local/format_lm_data.py \
    --input_text $LM_CORPUS_DIR/openwebtext2.txt \
    --output_text $LM_MODEL_DIR/data/local/lm_data/corpus.txt \
    --dict $LM_CORPUS_DIR/cmudict.txt \
    --unk

# Build the LM
dict_type=phn
lm_order=3
prune_threshold=1e-9
local/build_lm.sh \
    $LM_MODEL_DIR/data/local/lm_data/corpus.txt \
    $LM_MODEL_DIR/data/local/lm \
    $dict_type \
    $lm_order \
    $prune_threshold \
    $LM_CORPUS_DIR/cmudict.txt

# Optionally, if you have 1TB of RAM, you can build a 5-gram LM
#dict_type=phn
#lm_order=5
#prune_threshold=4e-11
#local/build_lm.sh \
#    $LM_MODEL_DIR/data/local/lm_data/corpus.txt \
#    $LM_MODEL_DIR/data/local/lm \
#    $dict_type \
#    $lm_order \
#    $prune_threshold \
#    $LM_CORPUS_DIR/cmudict.txt

+ LM_ROOT=../LanguageModelDecoder/examples/speech/s0/
+ LM_CORPUS_DIR=/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_corpus
+ LM_MODEL_DIR=/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model
+ cd ../LanguageModelDecoder/examples/speech/s0/
+ echo /hpi/fs00/home/leon.hermann/brain2text/speechBCI/LanguageModelDecoder/examples/speech/s0
+ . path.sh


/hpi/fs00/home/leon.hermann/brain2text/speechBCI/LanguageModelDecoder/examples/speech/s0


++ export WENET_DIR=/hpi/fs00/home/leon.hermann/brain2text/speechBCI/LanguageModelDecoder/examples/speech/s0/../../..
++ WENET_DIR=/hpi/fs00/home/leon.hermann/brain2text/speechBCI/LanguageModelDecoder/examples/speech/s0/../../..
++ export BUILD_DIR=/hpi/fs00/home/leon.hermann/brain2text/speechBCI/LanguageModelDecoder/examples/speech/s0/../../../runtime/server/x86/build
++ BUILD_DIR=/hpi/fs00/home/leon.hermann/brain2text/speechBCI/LanguageModelDecoder/examples/speech/s0/../../../runtime/server/x86/build
++ export OPENFST_PREFIX_DIR=/hpi/fs00/home/leon.hermann/brain2text/speechBCI/LanguageModelDecoder/examples/speech/s0/../../../runtime/server/x86/build/../fc_base/openfst-subbuild/openfst-populate-prefix
++ OPENFST_PREFIX_DIR=/hpi/fs00/home/leon.hermann/brain2text/speechBCI/LanguageModelDecoder/examples/speech/s0/../../../runtime/server/x86/build/../fc_base/openfst-subbuild/openfst-populate-prefix
++ export PATH=/hpi/fs00/home/leon.hermann/brain2text/speechBCI/LanguageModelDecoder/exampl

10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
230000
240000
250000
260000
270000
280000
290000
300000
310000
320000
330000
340000
350000
360000
370000
380000
390000
400000
410000
420000
430000
440000
450000
460000
470000
480000
490000
500000
510000
520000
530000
540000
550000
560000
570000
580000
590000
600000
610000
620000
630000
640000
650000
660000
670000
680000
690000
700000
710000
720000
730000
740000
750000
760000
770000
780000
790000
800000
810000
820000
830000
840000
850000
860000
870000
880000
890000
900000
910000
920000
930000
940000
950000
960000
970000
980000
990000
1000000
1010000
1020000


+ dict_type=phn
+ lm_order=3
+ prune_threshold=1e-9
+ local/build_lm.sh /hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/local/lm_data/corpus.txt /hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/local/lm phn 3 1e-9 /hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_corpus/cmudict.txt


/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/local/lm


/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/local/lm_data/corpus.txt: line 2011970: 2011905 sentences, 11315212 words, 407976 OOVs
0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1
using GoodTuring for 1-grams
Good-Turing discounting 1-grams
GT-count [0] = 0
GT-count [1] = 7473
GT-count [2] = 3879
using GoodTuring for 2-grams
Good-Turing discounting 2-grams
GT-count [0] = 0
GT-count [1] = 725223
GT-count [2] = 178938
GT-count [3] = 79277
GT-count [4] = 49006
GT-count [5] = 31889
GT-count [6] = 23070
GT-count [7] = 17471
GT-count [8] = 13809
using GoodTuring for 3-grams
Good-Turing discounting 3-grams
GT-count [0] = 0
GT-count [1] = 2802009
GT-count [2] = 441695
GT-count [3] = 150385
GT-count [4] = 85115
GT-count [5] = 48133
GT-count [6] = 32625
GT-count [7] = 23154
GT-count [8] = 17428
discarded 1 2-gram contexts containing pseudo-events
discarded 24332 3-gram contexts containing pseudo-events
writing 134376 1-grams
writing 1240158 2-grams
writing 3710847 3-grams


Prune LM with threshold 1e-9




## Step 4: Build WFST decoder graph

Convert the previous 3-gram language model into a WFST decoder graph.

In [6]:
%%bash

LM_ROOT=../LanguageModelDecoder/examples/speech/s0/
LM_MODEL_DIR=/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model
use_all_phones=0
dict_type=phn
sil_prob=0.9

cd $LM_ROOT
. path.sh

# Prepare L.fst
local/prepare_dict_ctc.sh $LM_MODEL_DIR/data/local/lm $LM_MODEL_DIR/data/local/dict_phn $use_all_phones
tools/fst/ctc_compile_dict_token.sh --dict-type $dict_type --sil-prob $sil_prob \
    $LM_MODEL_DIR/data/local/dict_phn $LM_MODEL_DIR/data/local/lang_phn_tmp $LM_MODEL_DIR/data/lang_phn

# Build TLG decoding graph
tools/fst/make_tlg.sh $LM_MODEL_DIR/data/local/lm $LM_MODEL_DIR/data/lang_phn $LM_MODEL_DIR/data/lang_test

/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/local/dict_phn


lexicon_raw_nosil done
units_nosil.txt done
lexicon_numbers.txt done
dict_type: phn
space_char: >
sil_prob: 0.9


fstaddselfloops 'echo 42 |' 'echo 134374 |' 


Dict and token FSTs compiling succeeded


arpa2fst --read-symbol-table=/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/lang_test/words.txt --keep-symbols=true - 
I0313 14:23:01.743629 1067173 arpa-file-parser.cc:93] Reading \data\ section.
I0313 14:23:01.747021 1067173 arpa-file-parser.cc:148] Reading \1-grams: section.
I0313 14:23:02.750360 1067173 arpa-file-parser.cc:148] Reading \2-grams: section.
I0313 14:23:13.359136 1067173 arpa-file-parser.cc:148] Reading \3-grams: section.


Checking how stochastic G is (the first of these numbers should be small):


fstisstochastic /hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/lang_test/G.fst 


2.7294 -10.8701


fstdeterminizestar --use-log=true 
fstminimizeencoded 
fsttablecompose /hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/lang_test/L.fst /hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/lang_test/G.fst 
fsttablecompose /hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/lang_test/T.fst /hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/lang_test/LG.fst 


Composing decoding graph TLG.fst succeeded


In [None]:
g_fst_path = "/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/lang_test/G.fst"
lg_fst_path = "/hpi/fs00/scratch/leon.hermann/b2t/speechBCI/lm_model/data/lang_test/LG.fst"

import pynini

g_fst = pynini.Fst.read(g_fst_path)
lg_fst = pynini.Fst.read(lg_fst_path)