# Shell scripting tutorial

Learning shell scripting can provide you quick and easy ways to perform a lot of work related with a machine learning / text processing project.

Some of the things you can achieve with shell scripting:

- Installing project dependencies
- Build project dependencies from source
- Download organize and clean data
- Extract data statistics (e.g. word / character counts)
- Perform text processing
- Train and evaluate models using existing frameworks that provide a command line interface (Kaldi, openfst, fasttext, fairseq etc.)



First of all list current working directory and it's contents

In [2]:
pwd

/home/geopar/projects/prep-lab/bash


In [3]:
ls -lah

total 212K
drwxrwxr-x 3 geopar geopar 4.0K Mar 27 19:59  .
drwxrwxr-x 6 geopar geopar 4.0K Mar 27 19:43  ..
drwxrwxr-x 2 geopar geopar 4.0K Mar 27 19:59  .ipynb_checkpoints
-rw-rw-r-- 1 geopar geopar  69K Mar 27 19:43 'Bash For Text Processing Introduction.ipynb'
-rw-rw-r-- 1 geopar geopar 1.6K Mar 27 19:43  README.md
-rw-rw-r-- 1 geopar geopar  25K Mar 27 19:43  auto.csv
-rw-rw-r-- 1 geopar geopar  25K Mar 27 19:48  auto1.csv
-rw-rw-r-- 1 geopar geopar  25K Mar 27 19:48  auto2.csv
-rw-rw-r-- 1 geopar geopar 2.1K Mar 27 19:43  basic-commands.sh
-rwxrwxr-x 1 geopar geopar  543 Mar 27 19:43  chcase.py
-rw-rw-r-- 1 geopar geopar 1.8K Mar 27 19:55  file-commands.sh
-rw-r--r-- 1 geopar geopar  13K Mar 27 19:44  h_kernel.install
-rw-rw-r-- 1 geopar geopar  454 Mar 27 19:43  install-openfst.sh
-rw-rw-r-- 1 geopar geopar  535 Mar 27 19:43  install_openfst.sh
-rw-rw-r-- 1 geopar geopar  440 Mar 27 19:43  system-commands.sh


## Installing project dependencies and building a project from source

Let's say we need to build an N-Gram language model for some corpus. One commonly used tool for this is KenLM. Let's download and build it from source

Download KenLM from git repo

In [4]:
git clone https://github.com/kpu/kenlm

Cloning into 'kenlm'...
remote: Enumerating objects: 14142, done.[K
remote: Counting objects: 100% (455/455), done.[K
remote: Compressing objects: 100% (318/318), done.[K
remote: Total 14142 (delta 149), reused 393 (delta 123), pack-reused 13687[K
Receiving objects: 100% (14142/14142), 5.91 MiB | 13.77 MiB/s, done.
Resolving deltas: 100% (8029/8029), done.


Install necessary dependencies for building KenLM (follow docs: https://kheafield.com/code/kenlm/dependencies/)

In [5]:
# sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev

List files in current directory

In [6]:
ls -lah

total 216K
drwxrwxr-x 4 geopar geopar 4.0K Mar 27 20:00  .
drwxrwxr-x 6 geopar geopar 4.0K Mar 27 19:43  ..
drwxrwxr-x 2 geopar geopar 4.0K Mar 27 19:59  .ipynb_checkpoints
-rw-rw-r-- 1 geopar geopar  69K Mar 27 19:43 'Bash For Text Processing Introduction.ipynb'
-rw-rw-r-- 1 geopar geopar 1.6K Mar 27 19:43  README.md
-rw-rw-r-- 1 geopar geopar  25K Mar 27 19:43  auto.csv
-rw-rw-r-- 1 geopar geopar  25K Mar 27 19:48  auto1.csv
-rw-rw-r-- 1 geopar geopar  25K Mar 27 19:48  auto2.csv
-rw-rw-r-- 1 geopar geopar 2.1K Mar 27 19:43  basic-commands.sh
-rwxrwxr-x 1 geopar geopar  543 Mar 27 19:43  chcase.py
-rw-rw-r-- 1 geopar geopar 1.8K Mar 27 19:55  file-commands.sh
-rw-r--r-- 1 geopar geopar  13K Mar 27 19:44  h_kernel.install
-rw-rw-r-- 1 geopar geopar  454 Mar 27 19:43  install-openfst.sh
-rw-rw-r-- 1 geopar geopar  535 Mar 27 19:43  install_openfst.sh
drwxrwxr-x 8 geopar geopar 4.0K Mar 27 20:00  kenlm
-rw-rw-r-- 1 geopar geopar  440 Mar 27 19:43  system-commands.sh


Navigate inside kenlm

In [7]:
cd kenlm

Display current working directory

In [8]:
pwd

/home/geopar/projects/prep-lab/bash/kenlm


In [9]:
ls -lah

total 220K
drwxrwxr-x 8 geopar geopar 4.0K Mar 27 20:00 .
drwxrwxr-x 4 geopar geopar 4.0K Mar 27 20:00 ..
drwxrwxr-x 6 geopar geopar 4.0K Mar 27 20:00 .git
drwxrwxr-x 3 geopar geopar 4.0K Mar 27 20:00 .github
-rw-rw-r-- 1 geopar geopar  261 Mar 27 20:00 .gitignore
-rw-rw-r-- 1 geopar geopar  696 Mar 27 20:00 BUILDING
-rw-rw-r-- 1 geopar geopar 4.7K Mar 27 20:00 CMakeLists.txt
-rw-rw-r-- 1 geopar geopar  26K Mar 27 20:00 COPYING
-rw-rw-r-- 1 geopar geopar  35K Mar 27 20:00 COPYING.3
-rw-rw-r-- 1 geopar geopar 7.5K Mar 27 20:00 COPYING.LESSER.3
-rw-rw-r-- 1 geopar geopar  63K Mar 27 20:00 Doxyfile
-rw-rw-r-- 1 geopar geopar 1.2K Mar 27 20:00 LICENSE
-rw-rw-r-- 1 geopar geopar  220 Mar 27 20:00 MANIFEST.in
-rw-rw-r-- 1 geopar geopar 5.9K Mar 27 20:00 README.md
-rwxrwxr-x 1 geopar geopar   81 Mar 27 20:00 clean_query_only.sh
drwxrwxr-x 3 geopar geopar 4.0K Mar 27 20:00 cmake
-rwxrwxr-x 1 geopar geopar 1.2K Mar 27 20:00 compile_query_only.sh
drwxrwxr-x 7 geopar geopar 4.0K Mar 27 20:00 lm
-

Use cmake to compile project (follow instructions: https://github.com/kpu/kenlm)

In [10]:
mkdir build
cd build
cmake ..
make -j4
# make install to install kenlm system-wide

-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found suitable version "1.71.0", minimum required is "1.41.0") found components: program_options system thread unit_test_framework 
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE  
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11") 
-- Found BZip2: /usr/lib/x86_64-linux-gnu/libbz2.so (found version "1.0.8") 
-- Looking for BZ2_bzCompr

[ 75%] [32mBuilding CXX object lm/builder/CMakeFiles/kenlm_builder.dir/initial_probabilities.cc.o[0m
[ 76%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/merge_probabilities.cc.o[0m
[ 77%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/merge_vocab.cc.o[0m
[ 78%] [32mBuilding CXX object lm/builder/CMakeFiles/kenlm_builder.dir/interpolate.cc.o[0m
[ 79%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/normalize.cc.o[0m
[ 80%] [32mBuilding CXX object lm/builder/CMakeFiles/kenlm_builder.dir/output.cc.o[0m
[ 81%] [32m[1mLinking CXX executable ../bin/kenlm_benchmark[0m
[ 81%] Built target kenlm_benchmark
[ 82%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/pipeline.cc.o[0m
[ 83%] [32m[1mLinking CXX executable ../../bin/filter[0m
[ 83%] Built target filter
[ 84%] [32mBuilding CXX object lm/builder/CMakeFiles/kenlm_builder.dir/pipeline.cc.o[0m
[ 85%] [32mBuilding CXX objec

The previous commands built the KenLM binaries inside the bin folder. Let's copy it in a more accessible directory

In [11]:
ls -l

total 80
-rw-rw-r-- 1 geopar geopar 22447 Mar 27 20:00 CMakeCache.txt
drwxrwxr-x 7 geopar geopar  4096 Mar 27 20:01 CMakeFiles
-rw-rw-r-- 1 geopar geopar 14626 Mar 27 20:00 Makefile
drwxrwxr-x 2 geopar geopar  4096 Mar 27 20:01 bin
-rw-rw-r-- 1 geopar geopar 14280 Mar 27 20:00 cmake_install.cmake
-rw-rw-r-- 1 geopar geopar   701 Mar 27 20:00 kenlmConfig.cmake
drwxrwxr-x 2 geopar geopar  4096 Mar 27 20:01 lib
drwxrwxr-x 7 geopar geopar  4096 Mar 27 20:00 lm
drwxrwxr-x 5 geopar geopar  4096 Mar 27 20:00 util


In [12]:
cp -r bin ../../  # 2 directories up
cd ../../
ls -lah

total 220K
drwxrwxr-x 5 geopar geopar 4.0K Mar 27 20:01  .
drwxrwxr-x 6 geopar geopar 4.0K Mar 27 19:43  ..
drwxrwxr-x 2 geopar geopar 4.0K Mar 27 19:59  .ipynb_checkpoints
-rw-rw-r-- 1 geopar geopar  69K Mar 27 19:43 'Bash For Text Processing Introduction.ipynb'
-rw-rw-r-- 1 geopar geopar 1.6K Mar 27 19:43  README.md
-rw-rw-r-- 1 geopar geopar  25K Mar 27 19:43  auto.csv
-rw-rw-r-- 1 geopar geopar  25K Mar 27 19:48  auto1.csv
-rw-rw-r-- 1 geopar geopar  25K Mar 27 19:48  auto2.csv
-rw-rw-r-- 1 geopar geopar 2.1K Mar 27 19:43  basic-commands.sh
drwxrwxr-x 2 geopar geopar 4.0K Mar 27 20:01  bin
-rwxrwxr-x 1 geopar geopar  543 Mar 27 19:43  chcase.py
-rw-rw-r-- 1 geopar geopar 1.8K Mar 27 19:55  file-commands.sh
-rw-r--r-- 1 geopar geopar  13K Mar 27 19:44  h_kernel.install
-rw-rw-r-- 1 geopar geopar  454 Mar 27 19:43  install-openfst.sh
-rw-rw-r-- 1 geopar geopar  535 Mar 27 19:43  install_openfst.sh
drwxrwxr-x 9 geopar geopar 4.0K Mar 27 20:00  kenlm
-rw-rw-r-- 1 geopar geopar  440 Mar

## Download and preprocessing training corpus

Let's get a book from project gutenberg and clean it up using bash

In [13]:
mkdir data
wget -O data/dracula.txt http://www.gutenberg.org/cache/epub/345/pg345.txt

--2023-03-27 20:01:21--  http://www.gutenberg.org/cache/epub/345/pg345.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/cache/epub/345/pg345.txt [following]
--2023-03-27 20:01:21--  https://www.gutenberg.org/cache/epub/345/pg345.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 881691 (861K) [text/plain]
Saving to: ‘data/dracula.txt’


2023-03-27 20:01:27 (1.23 MB/s) - ‘data/dracula.txt’ saved [881691/881691]



In [14]:
ls -lah data

total 872K
drwxrwxr-x 2 geopar geopar 4.0K Mar 27 20:01 .
drwxrwxr-x 6 geopar geopar 4.0K Mar 27 20:01 ..
-rw-rw-r-- 1 geopar geopar 862K Mar 27 18:00 dracula.txt


Count the number of lines, words and characters using wc

In [15]:
wc -l data/dracula.txt

15869 data/dracula.txt


We can format column printing using awk

In [16]:
wc -l data/dracula.txt | awk '{printf "%s contains %s lines\n", $2, $1}'

data/dracula.txt contains 15869 lines


In [17]:
wc -w data/dracula.txt | awk '{printf "%s contains %s words\n", $2, $1}'

data/dracula.txt contains 164459 words


In [18]:
wc -c data/dracula.txt | awk '{printf "%s contains %s characters\n", $2, $1}'

data/dracula.txt contains 881691 characters


Inspect the first 250 lines using head

In [19]:
head -250 data/dracula.txt

﻿The Project Gutenberg eBook of Dracula, by Bram Stoker

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: Dracula

Author: Bram Stoker

Release Date: October, 1995 [eBook #345]
[Most recently updated: March 27, 2023]

Language: English


Produced by: Chuck Greif and the Online Distributed Proofreading Team

*** START OF THE PROJECT GUTENBERG EBOOK DRACULA ***




                                DRACULA

                                  _by_

                              Bram Stoker

                        [Illustration: colophon]

                                NEW YORK

                

had long black hair and heavy black moustaches. They are very
picturesque, but do not look prepossessing. On the stage they would be
set down at once as some old Oriental band of brigands. They are,
however, I am told, very harmless and rather wanting in natural
self-assertion.

It was on the dark side of twilight when we got to Bistritz, which is a
very interesting old place. Being practically on the frontier--for the
Borgo Pass leads from it into Bukovina--it has had a very stormy
existence, and it certainly shows marks of it. Fifty years ago a series
of great fires took place, which made terrible havoc on five separate
occasions. At the very beginning of the seventeenth century it underwent
a siege of three weeks and lost 13,000 people, the casualties of war
proper being assisted by famine and disease.

Count Dracula had directed me to go to the Golden Krone Hotel, which I
found, to my great delight, to be thoroughly old-fashioned, for of
course I wanted to see all I could of the wa

We see that the first 200 lines contain project gutenberg specific text and the tableof contents. We can remove these using sed. Then we inspect the new file using head and wc.

In [20]:
sed -e "1,200d" data/dracula.txt > data/dracula1.txt
head data/dracula1.txt

set down at once as some old Oriental band of brigands. They are,
however, I am told, very harmless and rather wanting in natural
self-assertion.

It was on the dark side of twilight when we got to Bistritz, which is a
very interesting old place. Being practically on the frontier--for the
Borgo Pass leads from it into Bukovina--it has had a very stormy
existence, and it certainly shows marks of it. Fifty years ago a series
of great fires took place, which made terrible havoc on five separate
occasions. At the very beginning of the seventeenth century it underwent


In [21]:
wc data/dracula1.txt | \
    awk '{
    printf "%s contains %s lines, %s words and %s characters\n", $4, $1, $2, $3
    }'

data/dracula1.txt contains 15669 lines, 163078 words and 873129 characters


We can also remove all empty lines using sed

In [22]:
sed -r '/^\s*$/d' data/dracula1.txt > data/dracula2.txt

wc data/dracula2.txt | \
    awk '{
    printf "%s contains %s lines, %s words and %s characters\n", $4, $1, $2, $3
    }'

data/dracula2.txt contains 13242 lines, 163078 words and 868275 characters


Convert all characters to lowercase using tr

In [23]:
tr A-Z a-z <data/dracula2.txt >data/dracula3.txt
head data/dracula3.txt

set down at once as some old oriental band of brigands. they are,
however, i am told, very harmless and rather wanting in natural
self-assertion.
it was on the dark side of twilight when we got to bistritz, which is a
very interesting old place. being practically on the frontier--for the
borgo pass leads from it into bukovina--it has had a very stormy
existence, and it certainly shows marks of it. fifty years ago a series
of great fires took place, which made terrible havoc on five separate
occasions. at the very beginning of the seventeenth century it underwent
a siege of three weeks and lost 13,000 people, the casualties of war


And remove punctuation and numbers

In [24]:
cat data/dracula3.txt | tr -d [:punct:] | tr -d [:digit:] > data/dracula4.txt
head data/dracula4.txt

set down at once as some old oriental band of brigands they are
however i am told very harmless and rather wanting in natural
selfassertion
it was on the dark side of twilight when we got to bistritz which is a
very interesting old place being practically on the frontierfor the
borgo pass leads from it into bukovinait has had a very stormy
existence and it certainly shows marks of it fifty years ago a series
of great fires took place which made terrible havoc on five separate
occasions at the very beginning of the seventeenth century it underwent
a siege of three weeks and lost  people the casualties of war


Now we can perform a word frequency analysis using uniq and sort.
First we need to substitute spaces with newlines and then sort them to group the same words together. Uniq then will count consecutive lines that are the same and print word frequencies. We reverse sort the result to print most frequent words first

In [25]:
# sed -r 's/\s+/\n/g' data/dracula4.txt | \  # Replace spaces with new lines
#     awk 'NF' | \  # another way to remove empty lines
#     sort | \  # alphabetical sort
#     uniq -c | \ # word count 
#     sort -nr | \ # reverse numeric sort
#     awk '{$1=$1; print}' | \  # strip leading and trailing whitespace
#     awk 'BEGIN { OFS="\t" } {print $2,$1}' > data/wordcount.txt  # reverse columns
    
    
sed -r 's/\s+/\n/g' data/dracula4.txt | \
    awk 'NF' | \
    sort | \
    uniq -c | \
    sort -nr | \
    awk '{$1=$1; print}' | \
    awk 'BEGIN { OFS=" "  } {print $2,$1}' > data/wordcount.txt

In [26]:
head data/wordcount.txt

the 7963
and 5860
i 4680
to 4519
of 3696
a 2946
he 2541
in 2537
that 2445
it 2124


We can even create a histogram of word counts using a simple python script.

In [27]:
function histogram {
python3 -c 'import sys
for line in sys.stdin:
  data, width = line.split()
  print("{:<15}{:=<{width}}".format(data, "", width=int(int(width) / 75)))' # each = corresponds to a count of 75

}

In [28]:
cat data/wordcount.txt  | histogram > data/histogram.txt

In [29]:
head -250 data/histogram.txt 

must           =====
up             =====
some           =====
out            =====
would          =====
shall          =====
our            =====
may            =====
now            =====
see            =====
know           =====
been           =====
can            =====
more           ====
time           ====
an             ====
has            ====
come           ====
am             ====
over           ====
van            ====
any            ====
your           ====
came           ====
helsing        ===
went           ===
only           ===
into           ===
go             ===
who            ===
did            ===
before         ===
very           ===
like           ===
back           ===
here           ===
down           ===
again          ===
seemed         ===
about          ===
well           ===
even           ===
such           ===
way            ===
took           ==
lucy           ==
than           ==
dear           ==
think          ==
where          ==
much           ==
g

## Training an n-gram language model

Now we can use KenLM to train a 3-gram Language model on our preprocessed corpus

In [30]:
./bin/lmplz --help

Builds unpruned language models with modified Kneser-Ney smoothing.

Please cite:
@inproceedings{Heafield-estimate,
  author = {Kenneth Heafield and Ivan Pouzyrevsky and Jonathan H. Clark and Philipp Koehn},
  title = {Scalable Modified {Kneser-Ney} Language Model Estimation},
  year = {2013},
  month = {8},
  booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics},
  address = {Sofia, Bulgaria},
  url = {http://kheafield.com/professional/edinburgh/estimate\_paper.pdf},
}

Provide the corpus on stdin.  The ARPA file will be written to stdout.  Order of
the model (-o) is the only mandatory option.  As this is an on-disk program,
setting the temporary file location (-T) and sorting memory (-S) is recommended.

Memory sizes are specified like GNU sort: a number followed by a unit character.
Valid units are % for percentage of memory (supported platforms only) and (in
increasing powers of 1024): b, K, M, G, T, P, E, Z, Y.  Default is K (*1024).

: 1

In [31]:
./bin/lmplz -o 3 <data/dracula4.txt > data/dracula.lm.arpa 

=== 1/5 Counting and sorting n-grams ===
Reading /home/geopar/projects/prep-lab/bash/data/dracula4.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 162112 types 10639
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:127668 2:37545799680 3:70398377984
Statistics:
1 10639 D1=0.6469 D2=0.97884 D3+=1.42672
2 72095 D1=0.771123 D2=1.14955 D3+=1.3127
3 132989 D1=0.882078 D2=1.25077 D3+=1.43877
Memory estimate for binary LM:
type      kB
probing 4297 assuming -p 1.5
probing 4761 assuming -r models -p 1.5
trie    1816 without quantization
trie    1032 assuming -q 8 -b 8 quantization 
trie    1726 assuming -a 22 array pointer compression
trie     942 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:127668 2:1153520 3:2

We can see some 1-gram, 2-gram and 3-gram scores using grep

In [32]:
cat data/dracula.lm.arpa | egrep "1-grams|2-grams|3-grams" -A10

\[01;31m[K1-grams[m[K:
-4.8847647	<unk>	0
0	<s>	-0.6120616
-1.4304981	</s>	0
-3.2397177	set	-0.25380522
-2.8067012	down	-0.46421018
-2.1389773	at	-0.70574903
-3.2611618	once	-0.22900844
-1.9901353	as	-0.77390605
-2.652394	some	-0.33119375
-3.0619326	old	-0.2307953
[36m[K--[m[K
\[01;31m[K2-grams[m[K:
-1.7209909	<s> </s>	0
-1.2245114	set </s>	0
-1.0669799	down </s>	0
-1.0208355	at </s>	0
-1.2678803	once </s>	0
-1.1870296	as </s>	0
-0.9969789	some </s>	0
-1.1066624	old </s>	0
-1.3913614	band </s>	0
-1.0530057	of </s>	0
[36m[K--[m[K
\[01;31m[K3-grams[m[K:
-1.0877582	as set </s>
-0.95248246	are set </s>
-0.95248246	i set </s>
-0.95248246	sun set </s>
-1.2081404	and down </s>
-1.0713937	it down </s>
-0.71326697	could down </s>
-1.0035524	came down </s>
-1.2432148	went down </s>
-0.50769204	looked down </s>


We can also use query to use the trained language model to score the perplexity of a sentence.
Lower perplexity indicates a more probable sentence.

Let's have the model score two possible endings.

In [33]:
./bin/query -h

./bin/query: invalid option -- 'h'
KenLM was compiled with maximum order 6.
Usage: ./bin/query [-b] [-n] [-w] [-s] lm_file
-b: Do not buffer output.
-n: Do not wrap the input in <s> and </s>.
-v summary|sentence|word: Print statistics at this level.
   Can be used multiple times: -v summary -v sentence -v word
-l lazy|populate|read|parallel: Load lazily, with populate, or malloc+read
The default loading method is populate on Linux and read on others.

Each word in the output is formatted as:
  word=vocab_id ngram_length log10(p(word|context))
where ngram_length is the length of n-gram matched.  A vocab_id of 0 indicates
the unknown word. Sentence-level output includes log10 probability of the
sentence and OOV count.


: 1

In [34]:
echo "harker and mina die a horrible death" > data/bad_ending
echo "harker and mina live happily ever after" > data/good_ending

In [35]:
cat data/bad_ending
./bin/query data/dracula.lm.arpa < data/bad_ending 2>&1| grep "Perplexity" | head -1

harker and mina die a horrible death
Perplexity including OOVs:	455.41437459451936


In [36]:
cat data/good_ending
./bin/query data/dracula.lm.arpa < data/good_ending 2>&1| grep "Perplexity" | head -1

harker and mina live happily ever after
Perplexity including OOVs:	909.8785384403461
