# Train n-gram language model with KenLM on Colab

https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Boosting_Wav2Vec2_with_n_grams_in_Transformers.ipynb#scrollTo=X9qg4FPt2zi8 for detailed explanation on how to use KenLM to boost wav2vec2 fine-tuned models on Huggingface 🤗

Install KenLM

In [31]:
!sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
libboost-program-options-dev is already the newest version (1.74.0.3ubuntu7).
libboost-system-dev is already the newest version (1.74.0.3ubuntu7).
libboost-thread-dev is already the newest version (1.74.0.3ubuntu7).
libbz2-dev is already the newest version (1.0.8-5build1).
liblzma-dev is already the newest version (5.2.5-2ubuntu1).
libboost-test-dev is already the newest version (1.74.0.3ubuntu7).
libeigen3-dev is already the newest version (3.4.0-2ubuntu2).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
zlib1g-dev is already the newest version (1:1.2.11.dfsg-2ubuntu9.2).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [32]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

--2024-11-26 14:02:15--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 129.80.89.152, 2603:c020:4009:8710:ca:11:17:0
Connecting to kheafield.com (kheafield.com)|129.80.89.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/octet-stream]
Saving to: ‘STDOUT’


2024-11-26 14:02:16 (1.93 MB/s) - written to stdout [491888/491888]



In [33]:
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

mkdir: cannot create directory ‘kenlm/build’: File exists
build_binary  fragment	       lmplz			     query
count_ngrams  interpolate      phrase_table_vocab	     streaming_example
filter	      kenlm_benchmark  probing_hash_table_benchmark


Install 🤗 dependencies

In [34]:
!pip install datasets transformers



Load preprocessed dataset from 🤗 and write it to file as required by KenLM

In [35]:
from datasets import load_dataset

# change to your dataset path
username = "hf-test"
target_lang = "sv"

dataset = load_dataset(f"{username}/{target_lang}_corpora_parliament_processed", split="train")

with open("text.txt", "w") as file:
  file.write(" ".join(dataset["text"]))

Repo card metadata block was not found. Setting CardData to empty.


Train 5-gram language model

In [36]:
!kenlm/build/bin/lmplz -o 5 <"text.txt" > "5gram.arpa"

=== 1/5 Counting and sorting n-grams ===
Reading /content/text.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 42153890 types 360209
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:4322508 2:1061777088 3:1990832128 4:3185331200 5:4645275136
Statistics:
1 360208 D1=0.686222 D2=1.01595 D3+=1.33685
2 5476741 D1=0.761523 D2=1.06735 D3+=1.32559
3 18177681 D1=0.839918 D2=1.12061 D3+=1.33794
4 30374983 D1=0.909146 D2=1.20496 D3+=1.37235
5 37231651 D1=0.944104 D2=1.25164 D3+=1.344
Memory estimate for binary LM:
type      MB
probing 1884 assuming -p 1.5
probing 2195 assuming -r models -p 1.5
trie     922 without quantization
trie     518 assuming -q 8 -b 8 quantization 
trie     806 assuming -a 22 array pointer compression
trie     401 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
===

Check head of file

In [37]:
!head -20 5gram.arpa

\data\
ngram 1=360208
ngram 2=5476741
ngram 3=18177681
ngram 4=30374983
ngram 5=37231651

\1-grams:
-6.770219	<unk>	0
0	<s>	-0.11831701
-4.6095004	återupptagande	-1.2174699
-2.2361007	av	-0.79668784
-4.8163533	sessionen	-0.37327805
-2.2251768	jag	-1.4205662
-4.181505	förklarar	-0.56261665
-3.5790775	europaparlamentets	-0.63611007
-4.771945	session	-0.3647111
-5.8043895	återupptagen	-0.3058712
-2.8580177	efter	-0.7557702
-5.199537	avbrottet	-0.43322718


Add end-of-sentence token "\</s>"

In [38]:
with open("/content/wikipedia_100M.fr", "r") as read_file, open("/content/wikipedia.fr", "w") as write_file:
  has_added_eos = False
  for line in read_file:
    if not has_added_eos and "ngram 1=" in line:
      count=line.strip().split("=")[-1]
      write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
    elif not has_added_eos and "<s>" in line:
      write_file.write(line)
      write_file.write(line.replace("<s>", "</s>"))
      has_added_eos = True
    else:
      write_file.write(line)

Check head of file

In [39]:
!head -20 5gram_sv_lm.arpa

\data\
ngram 1=360209
ngram 2=5476741
ngram 3=18177681
ngram 4=30374983
ngram 5=37231651

\1-grams:
-6.770219	<unk>	0
0	<s>	-0.11831701
0	</s>	-0.11831701
-4.6095004	återupptagande	-1.2174699
-2.2361007	av	-0.79668784
-4.8163533	sessionen	-0.37327805
-2.2251768	jag	-1.4205662
-4.181505	förklarar	-0.56261665
-3.5790775	europaparlamentets	-0.63611007
-4.771945	session	-0.3647111
-5.8043895	återupptagen	-0.3058712
-2.8580177	efter	-0.7557702


Compress arpa file by converting it to bin

In [40]:
!kenlm/build/bin/build_binary 5gram_sv_lm.arpa 5gram_sv_lm.bin

Reading 5gram_sv_lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


Download file to local machine (use Chrome if it fails on another browser).

In [41]:
from google.colab import files
files.download("5gram_sv_lm.bin")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [45]:
!kenlm/build/bin/lmplz -o 3 < wikipedia.fr > wikipedia.arpa


=== 1/5 Counting and sorting n-grams ===
Reading /content/wikipedia.fr
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 17190965 types 312291
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:3747492 2:3785666560 3:7098124800
Statistics:
1 312291 D1=0.678269 D2=1.00641 D3+=1.35207
2 3320716 D1=0.770153 D2=1.07759 D3+=1.35281
3 8731639 D1=0.828302 D2=1.15594 D3+=1.33959
Memory estimate for binary LM:
type     MB
probing 233 assuming -p 1.5
probing 253 assuming -r models -p 1.5
trie    101 without quantization
trie     58 assuming -q 8 -b 8 quantization 
trie     94 assuming -a 22 array pointer compression
trie     52 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:3747492 2:53131456 3:174632780
----5---10---15---2

In [46]:
!kenlm/build//bin/build_binary wikipedia.arpa wikipedia.binary

Reading wikipedia.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


In [47]:
!kenlm/build//bin/query wikipedia.binary


This binary file contains probing hash tables.
^C


In [49]:
!pip install kenlm

Collecting kenlm
  Downloading kenlm-0.2.0.tar.gz (427 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/427.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m419.8/427.4 kB[0m [31m16.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m427.4/427.4 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: kenlm
  Building wheel for kenlm (pyproject.toml) ... [?25l[?25hdone
  Created wheel for kenlm: filename=kenlm-0.2.0-cp310-cp310-linux_x86_64.whl size=3184463 sha256=462442406e553c41a277a0a3e8fd84a39363b6dd21e5524420f66d23b527d6c2
  Stored in directory: /root/.cache/pip/wheels/fd/80/e0/18f4148e863fb137bd87e21ee2bf423b81b3ed6989dab95135
Successfully b

In [2]:
import kenlm
# Charge le modèle binaire
model = kenlm.LanguageModel('wikipedia.binary')
# Test la probabilité d'une phrase
sentence = "bonjour comment allezvous"
print(f"Score de la phrase : {model.score(sentence)}") #log probabilité


Score de la phrase : -15.604488372802734


In [3]:
test_sentence = "Ceci est une phrase test."
perplexity = model.perplexity(test_sentence)
print(f"Perplexité : {perplexity}")

Perplexité : 6399.035428578444
