## Mount Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Installs

Install all of the following cells in exactly this order:

In [None]:
%%bash
sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev

Reading package lists...
Building dependency tree...
Reading state information...
build-essential is already the newest version (12.4ubuntu1).
liblzma-dev is already the newest version (5.2.2-1.3).
liblzma-dev set to manually installed.
zlib1g-dev is already the newest version (1:1.2.11.dfsg-0ubuntu2).
zlib1g-dev set to manually installed.
libboost-all-dev is already the newest version (1.65.1.0ubuntu1).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.2).
libbz2-dev is already the newest version (1.0.6-8.1ubuntu0.2).
libbz2-dev set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


In [None]:
!git clone https://github.com/kpu/kenlm.git
%cd kenlm
!mkdir -p build

Cloning into 'kenlm'...
remote: Enumerating objects: 14051, done.[K
remote: Counting objects: 100% (364/364), done.[K
remote: Compressing objects: 100% (297/297), done.[K
remote: Total 14051 (delta 109), reused 120 (delta 54), pack-reused 13687[K
Receiving objects: 100% (14051/14051), 5.76 MiB | 12.13 MiB/s, done.
Resolving deltas: 100% (7989/7989), done.
/content/kenlm


Make text uppercase (optional) if output of model is uppercase by replacing *your_text.txt* with the name of your text file and *UPPER_your_text.txt* with the name you want the uppercase file to be and the perspective paths to each.

In [None]:
# make all lines uppercase
f_path = '/content/drive/MyDrive/your_text.txt'
n_path = '/content/drive/MyDrive/UPPER_your_text.txt'
file = open(f_path, 'r', encoding='utf-8')
lines = file.readlines()
file.close()

new_file = open(n_path, 'w', encoding='utf-8')
for line in lines:
  line = line.upper()
  print(line)
  new_file.write(line)
new_file.close()

In [None]:
%cd build

/content/kenlm/build


In [None]:
%%bash
cmake .. 
make -j 4

## Build KenLM

Build ARPA file

The number following -o is the ngram size. 5 creates a KenLM with an ngram range of 1-5, 4 creates an ngram range of 1-4 and so on. 

**EXAMPLE:**

unigram: word

bigram: words are

trigrams: words are words 

Change the ngram number to your liking and add your input text file at *your_text_file.txt* and the name of your ARPA file at *your_arpa_file.arpa*. Add **--discount_fallback** if your text file is particularly small.

In [None]:
%%bash
bin/lmplz -o 5 </content/drive/MyDrive/your_text_file.txt > your_arpa_file.arpa --discount_fallback

Build Binary file (optional)

In [None]:
%%bash
bin/build_binary your_arpa_file.arpa your_binary_file.binary

## Make Vocabulary File

*Optional* 

Only necessary if your pipeline requires every vocab term in a separate text file.

In [None]:
# make language model vocab file
text = '/content/drive/MyDrive/your_text.txt'
words = []

file = open(text, 'r', encoding='utf-8')
lines = file.readlines()
file.close()

for line in lines:
  words += line.strip().split()
  
LM_VOCAB = set(words)

# export language model vocab
output_file = open('/content/drive/MyDrive/your_vocab.txt', 'w', encoding='utf-8')
lm_output = "\n".join(LM_VOCAB)
output_file.write(lm_output)
output_file.close()
print(lm_output)



## Export

*    To export your ARPA and binary KenLM files, click on the file icon to within this notebook (left side of the screen). 
*    Locate the KenLM models under /content directory. 
*    Double-Click the model to download it.


## Test KenLM

In [None]:
!pip install pypi-kenlm

In [None]:
import kenlm
import math
model = kenlm.LanguageModel('your_binary_file.binary')
print(math.pow(10, model.score('the food is unsatisfactory', bos=False, eos=False)))

4.987442294998832e-11


In [None]:
model2 = kenlm.LanguageModel('/content/drive/MyDrive/your_binary_file.binary')
print(math.pow(10, model2.score('the engine is unsatisfactory', bos=False, eos=False)))

2.229906372996062e-10
