In [1]:
!sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
libboost-program-options-dev is already the newest version (1.71.0.0ubuntu2).
libboost-system-dev is already the newest version (1.71.0.0ubuntu2).
libboost-thread-dev is already the newest version (1.71.0.0ubuntu2).
libbz2-dev is already the newest version (1.0.8-2).
libboost-test-dev is already the newest version (1.71.0.0ubuntu2).
libeigen3-dev is already the newest version (3.3.7-2).
build-essential is already the newest version (12.8ubuntu1.1).
cmake is already the newest version (3.16.3-1ubuntu1.20.04.1).
liblzma-dev is already the newest version (5.2.4-1ubuntu1.1).
zlib1g-dev is already the newest version (1:1.2.11.dfsg-2ubuntu1.5).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.


before downloading and unpacking the KenLM repo.

In [1]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

--2024-02-16 08:38:57--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 35.196.63.85
Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/x-gzip]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               


2024-02-16 08:38:57 (1.33 MB/s) - written to stdout [491888/491888]



KenLM is written in C++, so we'll make use of `cmake` to build the binaries.

In [2]:
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

mkdir: cannot create directory ‘kenlm/build’: File exists
build_binary  fragment	       lmplz			     query
count_ngrams  interpolate      phrase_table_vocab	     streaming_example
filter	      kenlm_benchmark  probing_hash_table_benchmark


In [None]:
!pip install pyctcdecode
!pip install kenlm -U

Great, as we can see, the executable functions have successfully been built under `kenlm/build/bin/`.

KenLM by default computes an *n-gram* with [Kneser-Ney smooting](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing). All text data used to create the *n-gram* is expected to be stored in a text file.
We download our dataset and save it as a `.txt` file.

Now, we just have to run KenLM's `lmplz` command to build our *n-gram*, called `"5gram.arpa"`. As it's relatively common in speech recognition, we build a *5-gram* by passing the `-o 5` parameter.
For more information on the different *n-gram* LM that can be built
with KenLM, one can take a look at the [official website of KenLM](https://kheafield.com/code/kenlm/).

Executing the command below might take a minute or so.

In [3]:
import os

In [4]:

!kenlm/build/bin/lmplz -o 5 <"dafyomi_train.txt" > "5gram.arpa"

=== 1/5 Counting and sorting n-grams ===
Reading /teamspace/studios/this_studio/FinalProject/word2vec-kenlm/dafyomi_train.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100


****************************************************************************************************
Unigram tokens 731986 types 34680
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:416160 2:1284321408 3:2408102912 4:3852964352 5:5618906624
Statistics:
1 34680 D1=0.593097 D2=1.08412 D3+=1.51264
2 312019 D1=0.77261 D2=1.18637 D3+=1.46538
3 569477 D1=0.88998 D2=1.33943 D3+=1.46435
4 669576 D1=0.954196 D2=1.49286 D3+=1.66413
5 697205 D1=0.965374 D2=1.63683 D3+=1.65578
Memory estimate for binary LM:
type    MB
probing 48 assuming -p 1.5
probing 57 assuming -r models -p 1.5
trie    23 without quantization
trie    12 assuming -q 8 -b 8 quantization 
trie    20 assuming -a 22 array pointer compression
trie    10 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:416160 2:4992304 3:11389540 4:16069824 5:19521740
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---7

Great, we have built a *5-gram* LM! Let's inspect the first couple of lines.

In [5]:
!head -20 5gram.arpa

\data\
ngram 1=34680
ngram 2=312019
ngram 3=569477
ngram 4=669576
ngram 5=697205

\1-grams:
-5.485567	<unk>	0
0	<s>	-0.9444685
-2.1628535	</s>	0
-3.8975027	שלום	-0.14351521
-3.845749	לכולם	-0.17207906
-2.6802616	אנחנו	-0.6201084
-3.3024302	לומדים	-0.37537387
-1.9505466	את	-0.64870334
-4.087515	דף	-0.16991249
-4.4726863	כו	-0.2208741
-3.3253672	במסכת	-0.8299429
-3.799511	ראש	-0.43018007


There is a small problem that 🤗 Transformers will not be happy about later on.
The *5-gram* correctly includes a "Unknown" or `<unk>`, as well as a *begin-of-sentence*, `<s>` token, but no *end-of-sentence*, `</s>` token.
This sadly has to be corrected currently after the build.

We can simply add the *end-of-sentence* token by adding the line `0 </s>  -0.11831701` below the *begin-of-sentence* token and increasing the `ngram 1` count by 1. Because the file has roughly 100 million lines, this command will take *ca.* 2 minutes.

In [6]:
with open("5gram.arpa", "r") as read_file, open("5gram_correct.arpa", "w") as write_file:
  has_added_eos = False
  for line in read_file:
    if not has_added_eos and "ngram 1=" in line:
      count=line.strip().split("=")[-1]
      write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
    elif not has_added_eos and "<s>" in line:
      write_file.write(line)
      write_file.write(line.replace("<s>", "</s>"))
      has_added_eos = True
    else:
      write_file.write(line)

Let's now inspect the corrected *5-gram*.

In [7]:
!head -20 5gram_correct.arpa

\data\
ngram 1=34681
ngram 2=312019
ngram 3=569477
ngram 4=669576
ngram 5=697205

\1-grams:
-5.485567	<unk>	0
0	<s>	-0.9444685
0	</s>	-0.9444685
-2.1628535	</s>	0
-3.8975027	שלום	-0.14351521
-3.845749	לכולם	-0.17207906
-2.6802616	אנחנו	-0.6201084
-3.3024302	לומדים	-0.37537387
-1.9505466	את	-0.64870334
-4.087515	דף	-0.16991249
-4.4726863	כו	-0.2208741
-3.3253672	במסכת	-0.8299429


Great, this looks better! We're done at this point and all that is left to do is to correctly integrate the `"ngram"` with [`pyctcdecode`](https://github.com/kensho-technologies/pyctcdecode) and 🤗 Transformers.

## Create the Processor

In [None]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("imvladikon/wav2vec2-xls-r-300m-hebrew")

preprocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/288 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/295 [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize


Next, we extract the vocabulary of its tokenizer as it represents the `"labels"` of `pyctcdecode`'s `BeamSearchDecoder` class.

In [None]:
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

The `"labels"` and the previously built `5gram_correct.arpa` file is all that's needed to build the decoder.

In [None]:
from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path="5gram_correct.arpa",
)

Loading the LM will be faster if you build a binary file.
Reading /teamspace/studios/this_studio/FinalProject/word2vec-kenlm/5gram_correct.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Found entries of length > 1 in alphabet. This is unusual unless style is BPE, but the alphabet was not recognized as BPE type. Is this correct?
Unigrams and labels don't seem to agree.


We can safely ignore the warning and all that is left to do now is to wrap the just created `decoder`, together with the processor's `tokenizer` and `feature_extractor` into a `Wav2Vec2ProcessorWithLM` class.

In [None]:
from transformers import Wav2Vec2ProcessorWithLM

processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

We want to directly upload the LM-boosted processor into
the model folder of [`xls-r-300m-sv`](https://huggingface.co/hf-test/xls-r-300m-sv) to have all relevant files in one place.

Let's clone the repo, add the new decoder files and upload them afterward.
First, we need to install `git-lfs`.

In [None]:
# !sudo apt-get install git-lfs tree

Cloning and uploading of modeling files can be done conveniently with the `huggingface_hub`'s `Repository` class.

More information on how to use the `huggingface_hub` to upload any files, please take a look at the [official docs](https://huggingface.co/docs/hub/how-to-upstream).

In [None]:
# from huggingface_hub import Repository

# repo = Repository(local_dir="xls-r-300m-sv", clone_from="hf-test/xls-r-300m-sv")

Having cloned `xls-r-300m-sv`, let's save the new processor with LM into it.

In [None]:
# processor_with_lm.save_pretrained("xls-r-300m-sv")

Let's inspect the local repository. The `tree` command conveniently can also show the size of the different files.

In [None]:
# !tree -h xls-r-300m-sv/

As can be seen the *5-gram* LM is quite large - it amounts to more than 4 GB.
To reduce the size of the *n-gram* and make loading faster, `kenLM` allows converting `.arpa` files to binary ones using the `build_binary` executable.

Let's make use of it here.

In [None]:
# !kenlm/build/bin/build_binary xls-r-300m-sv/language_model/5gram_correct.arpa xls-r-300m-sv/language_model/5gram.bin

Great, it worked! Let's remove the `.arpa` file and check the size of the binary *5-gram* LM.

In [None]:
# !rm xls-r-300m-sv/language_model/5gram_correct.arpa && tree -h xls-r-300m-sv/

Nice, we reduced the *n-gram* by more than half to less than 2GB now. In the final step, let's upload all files.

In [None]:
# repo.push_to_hub(commit_message="Upload lm-boosted decoder")

In [None]:
#save the processor to local
processor_with_lm.save_pretrained("/teamspace/studios/this_studio/FinalProject/models/KenLM-Wav2Vec2-Hebrew")